Web Crawler Multithreaded - Problem

Imagine you're building a web crawler that needs to explore all pages within a specific website domain as quickly as possible. You have a starting URL and need to discover all linked pages that belong to the same hostname, but there's a catch - single-threaded crawling is too slow!

Given a startUrl and an HtmlParser interface, implement a multi-threaded web crawler that:

  • ๐ŸŒ Starts crawling from startUrl
  • ๐Ÿ“„ Uses HtmlParser.getUrls(url) to extract all URLs from each page
  • ๐Ÿšซ Never crawls the same URL twice (avoid infinite loops)
  • ๐Ÿ  Only explores URLs with the same hostname as the starting URL
  • โšก Utilizes multiple threads for concurrent crawling

Hostname Rules: URLs http://leetcode.com/problems and http://leetcode.com/contest share the same hostname (leetcode.com), but http://example.org/test and http://example.com/abc have different hostnames.

The HtmlParser interface is:

interface HtmlParser {
    // Returns all URLs found on the given webpage
    // This is a blocking HTTP request (takes ~15ms max)
    public List<String> getUrls(String url);
}

Challenge: Single-threaded solutions will exceed the time limit. Can your multi-threaded approach crawl faster by processing multiple pages simultaneously?

Input & Output

example_1.py โ€” Basic Tree Structure
$ Input: startUrl = "http://news.yahoo.com/news/topics/" urls = ["http://news.yahoo.com/news/topics/", "http://news.yahoo.com/news/topics/1", "http://news.yahoo.com/news/topics/2"] edges = [[0,1],[0,2]] (URLs 0->1, 0->2 means URL 0 links to URLs 1 and 2)
โ€บ Output: ["http://news.yahoo.com/news/topics/", "http://news.yahoo.com/news/topics/1", "http://news.yahoo.com/news/topics/2"]
๐Ÿ’ก Note: Starting from the root URL, we discover and crawl all linked pages within the same hostname (news.yahoo.com). The multi-threaded approach processes multiple URLs concurrently.
example_2.py โ€” Complex Network
$ Input: startUrl = "http://news.yahoo.com/news/topics/" urls = ["http://news.yahoo.com/news/topics/", "http://news.yahoo.com/news/topics/1", "http://news.yahoo.com/news/topics/2", "http://news.google.com"] edges = [[0,1],[0,2],[1,3],[2,3]] (Cross-links between pages, including external domain)
โ€บ Output: ["http://news.yahoo.com/news/topics/", "http://news.yahoo.com/news/topics/1", "http://news.yahoo.com/news/topics/2"]
๐Ÿ’ก Note: Even though URLs 1 and 2 link to google.com, we only crawl URLs with the same hostname (yahoo.com). External links are filtered out.
example_3.py โ€” Circular References
$ Input: startUrl = "http://example.com/page1" urls = ["http://example.com/page1", "http://example.com/page2", "http://example.com/page3"] edges = [[0,1],[1,2],[2,0]] (Circular reference: page1->page2->page3->page1)
โ€บ Output: ["http://example.com/page1", "http://example.com/page2", "http://example.com/page3"]
๐Ÿ’ก Note: Despite circular references, each URL is crawled exactly once due to the visited set. Multi-threading safely handles concurrent access to shared data structures.

Visualization

Tap to expand
Thread Pool ManagerURL QueueWorker 1Worker 2Worker 3Worker 4HTML ParserVisited SetResultsSYNC๐Ÿš€ Concurrent Web CrawlingMultiple threads process URLs simultaneously with safe coordinationโšก Significant Speed Improvement!
Understanding the Visualization
1
Initialize Resources
Create shared queue, visited set, and launch worker threads
2
Parallel Processing
Each thread takes URLs from queue and processes them concurrently
3
Coordinate Updates
Threads safely add discovered URLs to queue and update shared state
4
Detect Completion
Workers coordinate to detect when all URLs have been processed
Key Takeaway
๐ŸŽฏ Key Insight: Multi-threading dramatically improves crawling performance by processing multiple URLs concurrently while using proper synchronization to prevent race conditions and ensure correctness.

Time & Space Complexity

Time Complexity
โฑ๏ธ
O((V + E) / T)

V URLs and E links divided by T threads, assuming good load distribution and no bottlenecks

n
2n
โœ“ Linear Growth
Space Complexity
O(V + T)

Space for visited set (V URLs) plus thread stacks and synchronization structures (T threads)

n
2n
โœ“ Linear Space

Constraints

  • 1 โ‰ค urls.length โ‰ค 1000
  • 1 โ‰ค urls[i].length โ‰ค 300
  • startUrl is one of the urls
  • All URLs follow the format: http://hostname/path
  • HtmlParser.getUrls(url) returns URLs within 15ms
  • Single-threaded solutions will exceed time limit
Asked in
Google 45 Amazon 38 Meta 32 Microsoft 28
58.0K Views
High Frequency
~25 min Avg. Time
1.7K Likes
Ln 1, Col 1
Smart Actions
๐Ÿ’ก Explanation
AI Ready
๐Ÿ’ก Suggestion Tab to accept Esc to dismiss
// Output will appear here after running code
Code Editor Closed
Click the red button to reopen