Tutorialspoint
Problem
Solution
Submissions

Create a Web Scraper with Rate Limiting and Caching

Certification: Advanced Level Accuracy: 100% Submissions: 2 Points: 15

Write a Python program to implement a web scraper that fetches and parses content from websites while implementing rate limiting to avoid overloading servers and caching to improve performance. Your task is to implement the WebScraper class with methods for fetching web pages, parsing content, and managing the cache and rate limits.

Example 1
  • Input: scraper = WebScraper(rate_limit=10) # 10 requests per minute url = "https://example.com" selector = "h1" result = scraper.fetch_and_extract(url, selector)
  • Output: ["Example Domain"]
  • Explanation:
    • Step 1: Initialize scraper with rate limit of 10 requests/minute.
    • Step 2: Fetch "https://example.com" respecting rate limits.
    • Step 3: Extract all H1 tags (finds "Example Domain").
Example 2
  • Input: scraper = WebScraper(rate_limit=5, cache_expiry=3600) # 5 requests per minute, 1 hour cache urls = ["https://example.com", "https://example.org", "https://example.com"] # Note the repeated URL selector = "title" results = [scraper.fetch_and_extract(url, selector) for url in urls]
  • Output: [["Example Domain"], ["Example Domain"], ["Example Domain"]]
  • Explanation:
    • Step 1: Initialize scraper with rate limit and cache settings.
    • Step 2: First two unique URLs are fetched (rate limited).
    • Step 3: Third URL uses cached content instead of new request.
    • Step 4: All return the same title "Example Domain".
Constraints
  • 1 ≤ rate_limit ≤ 60 (requests per minute)
  • 0 ≤ cache_expiry ≤ 86400 (seconds, 24 hours max)
  • URLs must be valid and start with http:// or https://
  • Selectors must be valid HTML tags (e.g., "h1", "p", "title")
  • Time Complexity: O(n) where n is the number of requests
  • Space Complexity: O(m) where m is the size of the cache
Dictionaries AlgorithmsMicrosoftIBM
Editorial

Login to view the detailed solution and explanation for this problem.

My Submissions
All Solutions
Lang Status Date Code
You do not have any submissions for this problem.
User Lang Status Date Code
No submissions found.

Please Login to continue
Solve Problems

 
 
 
Output Window

Don't have an account? Register

Solution Hints


  • Use the urllib.request module for HTTP requests

  • Use html.parser.HTMLParser to extract specific HTML tags

  • Implement time-based rate limiting using time.sleep

  • Use a dictionary to store cached pages with timestamps

  • Handle exceptions such as invalid URLs, timeouts, or server errors using try-except blocks




Steps to solve by this approach:

 Step 1: Initialize the WebScraper with rate limit parameters and cache structure
 Step 2: Implement rate limiting function to enforce request intervals
 Step 3: Create a caching mechanism with expiration times
 Step 4: Implement URL fetching using urllib and handle errors
 Step 5: Parse the fetched content using HTMLParser and extract elements matching the tag

Submitted Code :