
									 Problem
								
								
									 Solution
								
								
									 Submissions
								
								
							Create a Web Scraper with Rate Limiting and Caching
								Certification: Advanced Level
								Accuracy: 100%
								Submissions: 2
								Points: 15
							
							Write a Python program to implement a web scraper that fetches and parses content from websites while implementing rate limiting to avoid overloading servers and caching to improve performance. Your task is to implement the WebScraper class with methods for fetching web pages, parsing content, and managing the cache and rate limits.
Example 1
- Input: scraper = WebScraper(rate_limit=10) # 10 requests per minute url = "https://example.com" selector = "h1" result = scraper.fetch_and_extract(url, selector)
- Output: ["Example Domain"]
- Explanation: 
    - Step 1: Initialize scraper with rate limit of 10 requests/minute.
- Step 2: Fetch "https://example.com" respecting rate limits.
- Step 3: Extract all H1 tags (finds "Example Domain").
 
Example 2
- Input: scraper = WebScraper(rate_limit=5, cache_expiry=3600) # 5 requests per minute, 1 hour cache urls = ["https://example.com", "https://example.org", "https://example.com"] # Note the repeated URL selector = "title" results = [scraper.fetch_and_extract(url, selector) for url in urls]
- Output: [["Example Domain"], ["Example Domain"], ["Example Domain"]]
- Explanation: 
    - Step 1: Initialize scraper with rate limit and cache settings.
- Step 2: First two unique URLs are fetched (rate limited).
- Step 3: Third URL uses cached content instead of new request.
- Step 4: All return the same title "Example Domain".
 
Constraints
- 1 ≤ rate_limit ≤ 60 (requests per minute)
- 0 ≤ cache_expiry ≤ 86400 (seconds, 24 hours max)
- URLs must be valid and start with http:// or https://
- Selectors must be valid HTML tags (e.g., "h1", "p", "title")
- Time Complexity: O(n) where n is the number of requests
- Space Complexity: O(m) where m is the size of the cache
Editorial
									
												
My Submissions
										All Solutions
									| Lang | Status | Date | Code | 
|---|---|---|---|
| You do not have any submissions for this problem. | |||
| User | Lang | Status | Date | Code | 
|---|---|---|---|---|
| No submissions found. | ||||
Solution Hints
- Use the urllib.requestmodule for HTTP requests
- Use html.parser.HTMLParserto extract specific HTML tags
- Implement time-based rate limiting using time.sleep
- Use a dictionary to store cached pages with timestamps
- Handle exceptions such as invalid URLs, timeouts, or server errors using try-except blocks
