Given a startUrl and an interface HtmlParser, implement a web crawler to crawl all links that are under the same hostname as startUrl.

Your crawler should:

  • Start from the page: startUrl
  • Call HtmlParser.getUrls(url) to get all urls from a webpage
  • Do not crawl the same link twice
  • Explore only the links that are under the same hostname as startUrl

For example, if startUrl = "http://news.yahoo.com/news/topics/", then the hostname is news.yahoo.com. You should only crawl URLs like http://news.yahoo.com/...

Note: Consider the same URL with trailing slash "/" as different. For example, "http://news.yahoo.com" and "http://news.yahoo.com/" are different urls.

Input & Output

Example 1 — Basic Web Crawling
$ Input: startUrl = "http://news.yahoo.com", urls = ["http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/", "http://news.google.com"], edges = [[0,2],[2,1],[3,2],[3,1],[1,4]]
Output: ["http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/"]
💡 Note: Starting from "http://news.yahoo.com", we can reach pages with the same hostname "news.yahoo.com". The Google URL is excluded as it has different hostname.
Example 2 — Single Page
$ Input: startUrl = "http://news.yahoo.com", urls = ["http://news.yahoo.com"], edges = []
Output: ["http://news.yahoo.com"]
💡 Note: Only the start URL with no outgoing links, so result contains just the start page.
Example 3 — Different Hostname Filtering
$ Input: startUrl = "http://example.com/page", urls = ["http://example.com/page", "http://example.com/about", "http://other.com/page"], edges = [[0,1],[0,2]]
Output: ["http://example.com/page", "http://example.com/about"]
💡 Note: From start page, both links are found but only the example.com/about is included as other.com has different hostname.

Constraints

  • 1 ≤ urls.length ≤ 1000
  • 1 ≤ urls[i].length ≤ 300
  • startUrl is one of the urls
  • All URLs follow the format http://hostname/path without port

Visualization

Tap to expand
Web Crawler - DFS Approach INPUT URL Graph Structure yahoo /news /topics google Same hostname Different host startUrl: http://news.yahoo.com Target Hostname: news.yahoo.com ALGORITHM STEPS 1 Extract Hostname Parse startUrl to get host 2 Initialize DFS visited Set + result List 3 DFS Traversal Recursively explore URLs 4 Filter by Host Skip if different hostname DFS Stack Trace yahoo.com [START] /news/topics/ /news google.com [SKIP] OK OK OK X FINAL RESULT Crawled URLs (3 total) news. yahoo.com /news /topics Output Array: news.yahoo.com news.yahoo.com/news .../news/topics/ 3 URLs Crawled - OK Key Insight: DFS ensures we explore each URL branch deeply before backtracking. The hostname check (comparing URL's host with startUrl's host) filters out external links. Using a visited Set prevents infinite loops from circular links. Time: O(N), Space: O(N) for N URLs. TutorialsPoint - Web Crawler | Optimized Depth-First Search
Asked in
Google 45 Facebook 35 Amazon 28 Microsoft 22
28.5K Views
Medium Frequency
~25 min Avg. Time
892 Likes
Ln 1, Col 1
Smart Actions
💡 Explanation
AI Ready
💡 Suggestion Tab to accept Esc to dismiss
// Output will appear here after running code
Code Editor Closed
Click the red button to reopen