Web Crawler Multithreaded - Problem

Given a URL startUrl and an interface HtmlParser, implement a multi-threaded web crawler to crawl all links that are under the same hostname as startUrl.

Your crawler should:

  • Start from the page: startUrl
  • Call HtmlParser.getUrls(url) to get all URLs from a webpage
  • Do not crawl the same link twice
  • Explore only the links that are under the same hostname as startUrl

The HtmlParser interface is defined as:

interface HtmlParser {
    public List<String> getUrls(String url);
}

Note: getUrls(url) is a blocking call that simulates performing an HTTP request. Single-threaded solutions will exceed the time limit, so you need a multi-threaded solution.

Input & Output

Example 1 — Basic Website Crawling
$ Input: startUrl = "http://news.yahoo.com/news/topics/", htmlParser returns {"http://news.yahoo.com/news/topics/": ["http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/business"]}
Output: ["http://news.yahoo.com/news/topics/", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/business"]
💡 Note: Start with the given URL, crawl it to find 2 links with same hostname, and return all 3 URLs found
Example 2 — Single Page
$ Input: startUrl = "http://example.com", htmlParser returns {"http://example.com": ["http://other.com/page"]}
Output: ["http://example.com"]
💡 Note: Only the start URL has the correct hostname, other link is from different domain so ignored
Example 3 — Circular References
$ Input: startUrl = "http://test.com", htmlParser returns {"http://test.com": ["http://test.com/page"], "http://test.com/page": ["http://test.com"]}
Output: ["http://test.com", "http://test.com/page"]
💡 Note: Both pages link to each other, but visited set prevents infinite crawling

Constraints

  • 1 ≤ urls.length ≤ 1000
  • 1 ≤ urls[i].length ≤ 300
  • startUrl is one of the urls
  • Hostname label must be from 1 to 63 characters long

Visualization

Tap to expand
Web Crawler Multithreaded INPUT startUrl /news /business startUrl: "news.yahoo.com/news/topics/" HtmlParser.getUrls(url) Returns child URLs Hostname: news.yahoo.com Multi-threaded (Thread Pool) ALGORITHM STEPS 1 Initialize Create thread pool, visited set, and queue with startUrl 2 Extract Hostname Parse startUrl to get allowed hostname 3 Parallel Crawl Each thread: getUrls() Filter by hostname 4 Sync and Add Lock visited set, add new URLs to queue Thread Pool Execution T1 T2 T3 T4 ... FINAL RESULT All Crawled URLs: news.yahoo.com /news/topics/ news.yahoo.com /news news.yahoo.com /news/topics/business Output Array: [ "http://news.yahoo.com/news/topics/", "http://news.yahoo.com/news", "...topics/business" ] OK - 3 URLs Found Same hostname only Key Insight: Use a thread pool with concurrent data structures (ConcurrentHashMap for visited, BlockingQueue for URLs). The blocking getUrls() call is parallelized across threads. Synchronize only when adding to visited set. Filter URLs by hostname before adding to queue. Use CountDownLatch or similar for completion tracking. TutorialsPoint - Web Crawler Multithreaded | Thread Pool Approach
Asked in
Google 15 Facebook 12 Amazon 8
23.5K Views
Medium Frequency
~35 min Avg. Time
892 Likes
Ln 1, Col 1
Smart Actions
💡 Explanation
AI Ready
💡 Suggestion Tab to accept Esc to dismiss
// Output will appear here after running code
Code Editor Closed
Click the red button to reopen