Imagine you're building a web crawler that needs to explore all pages within a specific website domain as quickly as possible. You have a starting URL and need to discover all linked pages that belong to the same hostname, but there's a catch - single-threaded crawling is too slow!
Given a startUrl and an HtmlParser interface, implement a multi-threaded web crawler that:
- ๐ Starts crawling from
startUrl - ๐ Uses
HtmlParser.getUrls(url)to extract all URLs from each page - ๐ซ Never crawls the same URL twice (avoid infinite loops)
- ๐ Only explores URLs with the same hostname as the starting URL
- โก Utilizes multiple threads for concurrent crawling
Hostname Rules: URLs http://leetcode.com/problems and http://leetcode.com/contest share the same hostname (leetcode.com), but http://example.org/test and http://example.com/abc have different hostnames.
The HtmlParser interface is:
interface HtmlParser {
// Returns all URLs found on the given webpage
// This is a blocking HTTP request (takes ~15ms max)
public List<String> getUrls(String url);
}Challenge: Single-threaded solutions will exceed the time limit. Can your multi-threaded approach crawl faster by processing multiple pages simultaneously?
Input & Output
Visualization
Time & Space Complexity
V URLs and E links divided by T threads, assuming good load distribution and no bottlenecks
Space for visited set (V URLs) plus thread stacks and synchronization structures (T threads)
Constraints
- 1 โค urls.length โค 1000
- 1 โค urls[i].length โค 300
- startUrl is one of the urls
- All URLs follow the format: http://hostname/path
- HtmlParser.getUrls(url) returns URLs within 15ms
- Single-threaded solutions will exceed time limit