Developing a Web Crawler with Python and the Requests Library

From news articles and e-commerce platforms to social media updates and blog posts, the web is a treasure trove of valuable data. However, manually navigating through countless web pages to gather this information is a time-consuming and tedious task. That's where web crawling comes in.

What is Web Crawling?

Web crawling, also known as web scraping, is a technique used to systematically browse and extract data from websites. It involves writing a script or program that automatically visits web pages, follows links, and gathers relevant data for further analysis. This process is essential for various applications, such as web indexing, data mining, and content aggregation.

Python, with its simplicity and versatility, has become one of the most popular programming languages for web crawling tasks. Its rich ecosystem of libraries and frameworks provides developers with powerful tools to build efficient and robust web crawlers. One such library is the requests library.

Python Requests Library

The requests library is a widely used Python library that simplifies the process of sending HTTP requests and interacting with web pages. It provides an intuitive interface for making requests to web servers and handling the responses.

With just a few lines of code, you can retrieve web content, extract data, and perform various operations on the retrieved information.

Getting Started

To begin, let's ensure that we have the requests library installed. We can easily install it using pip, the Python package manager.

Open your terminal or command prompt and enter the following command ?

pip install requests
pip install beautifulsoup4

With the requests library installed, we are ready to dive into the main content and start developing our web crawler.

Building a Web Crawler

Step 1: Importing the Required Libraries

To begin, we need to import the requests library, which will enable us to send HTTP requests and retrieve web page data. We will also import other necessary libraries for data manipulation and parsing ?

import requests
from bs4 import BeautifulSoup
import time

Step 2: Sending a GET Request

The first step in web crawling is sending a GET request to a web page. We can use the requests library's get() function to retrieve the HTML content of a web page ?

import requests

url = "https://httpbin.org/html"
response = requests.get(url)

print(f"Status Code: {response.status_code}")
print(f"Content Type: {response.headers.get('content-type')}")
Status Code: 200
Content Type: text/html; charset=utf-8

Step 3: Parsing the HTML Content

Once we have the HTML content, we need to parse it to extract the relevant information. The BeautifulSoup library provides a convenient way to parse HTML and navigate through its elements ?

import requests
from bs4 import BeautifulSoup

url = "https://httpbin.org/html"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# Find the title of the page
title = soup.find("title")
print(f"Page Title: {title.text}")
Page Title: Herman Melville - Moby-Dick

Step 4: Extracting Data

With the parsed HTML, we can now extract the desired data. This can involve locating specific elements, extracting text, retrieving attribute values, and more ?

import requests
from bs4 import BeautifulSoup

url = "https://httpbin.org/html"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# Find all paragraphs
paragraphs = soup.find_all("p")
print(f"Found {len(paragraphs)} paragraphs")

# Extract text from first paragraph
if paragraphs:
    first_para = paragraphs[0].text.strip()
    print(f"First paragraph: {first_para[:100]}...")
Found 3 paragraphs
First paragraph: Call me Ishmael. Some years ago...

Step 5: Complete Web Crawler Example

Here's a complete example that demonstrates a simple web crawler with error handling and rate limiting ?

import requests
from bs4 import BeautifulSoup
import time

def simple_crawler(url, max_pages=3):
    visited_urls = set()
    urls_to_visit = [url]
    
    for _ in range(max_pages):
        if not urls_to_visit:
            break
            
        current_url = urls_to_visit.pop(0)
        
        if current_url in visited_urls:
            continue
            
        try:
            print(f"\nCrawling: {current_url}")
            response = requests.get(current_url, timeout=10)
            
            if response.status_code == 200:
                soup = BeautifulSoup(response.text, "html.parser")
                
                # Extract title
                title = soup.find("title")
                if title:
                    print(f"Title: {title.text.strip()}")
                
                # Find all links (limit to first 3 for demo)
                links = soup.find_all("a", href=True)[:3]
                print(f"Found {len(links)} links:")
                
                for link in links:
                    href = link.get("href")
                    link_text = link.text.strip()
                    print(f"  - {href}: {link_text}")
                
                visited_urls.add(current_url)
                
            else:
                print(f"Failed to fetch {current_url}: Status {response.status_code}")
                
        except requests.exceptions.RequestException as e:
            print(f"Error crawling {current_url}: {e}")
        
        # Be respectful - add delay between requests
        time.sleep(1)

# Run the crawler
simple_crawler("https://httpbin.org/html")
Crawling: https://httpbin.org/html
Title: Herman Melville - Moby-Dick
Found 3 links:
  - https://www.gutenberg.org/ebooks/2701: Project Gutenberg
  - https://en.wikipedia.org/wiki/Moby-Dick: Wikipedia
  - https://www.melville.org: Melville Society

Best Practices

When developing web crawlers, it's important to follow these best practices ?

  • Respect robots.txt: Check the website's robots.txt file to understand crawling guidelines
  • Add delays: Include delays between requests to avoid overwhelming servers
  • Handle errors: Implement proper error handling for network issues and HTTP errors
  • Set user agents: Identify your crawler with a proper user agent string
  • Limit requests: Set reasonable limits on the number of pages to crawl

Conclusion

Web crawling using Python and the requests library empowers you to extract valuable data from websites efficiently. By combining requests for HTTP operations and BeautifulSoup for HTML parsing, you can build robust crawlers that automate data collection tasks. Always remember to crawl responsibly and respect website terms of service.

Updated on: 2026-03-27T14:13:14+05:30

436 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements