
Problem
Solution
Submissions
Create a Web Scraper with Rate Limiting and Caching
Certification: Advanced Level
Accuracy: 100%
Submissions: 2
Points: 15
Write a Python program to implement a web scraper that fetches and parses content from websites while implementing rate limiting to avoid overloading servers and caching to improve performance. Your task is to implement the WebScraper
class with methods for fetching web pages, parsing content, and managing the cache and rate limits.
Example 1
- Input: scraper = WebScraper(rate_limit=10) # 10 requests per minute url = "https://example.com" selector = "h1" result = scraper.fetch_and_extract(url, selector)
- Output: ["Example Domain"]
- Explanation:
- Step 1: Initialize scraper with rate limit of 10 requests/minute.
- Step 2: Fetch "https://example.com" respecting rate limits.
- Step 3: Extract all H1 tags (finds "Example Domain").
Example 2
- Input: scraper = WebScraper(rate_limit=5, cache_expiry=3600) # 5 requests per minute, 1 hour cache urls = ["https://example.com", "https://example.org", "https://example.com"] # Note the repeated URL selector = "title" results = [scraper.fetch_and_extract(url, selector) for url in urls]
- Output: [["Example Domain"], ["Example Domain"], ["Example Domain"]]
- Explanation:
- Step 1: Initialize scraper with rate limit and cache settings.
- Step 2: First two unique URLs are fetched (rate limited).
- Step 3: Third URL uses cached content instead of new request.
- Step 4: All return the same title "Example Domain".
Constraints
- 1 ≤ rate_limit ≤ 60 (requests per minute)
- 0 ≤ cache_expiry ≤ 86400 (seconds, 24 hours max)
- URLs must be valid and start with http:// or https://
- Selectors must be valid HTML tags (e.g., "h1", "p", "title")
- Time Complexity: O(n) where n is the number of requests
- Space Complexity: O(m) where m is the size of the cache
Editorial
My Submissions
All Solutions
Lang | Status | Date | Code |
---|---|---|---|
You do not have any submissions for this problem. |
User | Lang | Status | Date | Code |
---|---|---|---|---|
No submissions found. |
Solution Hints
- Use the
urllib.request
module for HTTP requests - Use
html.parser.HTMLParser
to extract specific HTML tags - Implement time-based rate limiting using
time.sleep
- Use a dictionary to store cached pages with timestamps
- Handle exceptions such as invalid URLs, timeouts, or server errors using try-except blocks