Imagine you're building the core engine for a search engine that needs to systematically explore and index web pages. Your task is to implement a web crawler that starts from a given URL and discovers all pages within the same domain.

Given a starting URL startUrl and an HtmlParser interface, your crawler must:

  • Start crawling from startUrl
  • Extract links using HtmlParser.getUrls(url) to get all URLs from any webpage
  • Stay within domain - only crawl URLs that share the same hostname as startUrl
  • Avoid duplicates - never crawl the same URL twice
  • Return all discovered URLs in any order

Example: If startUrl is http://news.yahoo.com/news, then http://news.yahoo.com/sports should be crawled, but http://sports.yahoo.com/news should not (different subdomain).

Important: URLs with and without trailing slashes are considered different (e.g., http://news.yahoo.com vs http://news.yahoo.com/).

The HtmlParser interface is provided:

interface HtmlParser {
    public List<String> getUrls(String url);
}

Input & Output

example_1.py โ€” Basic Website Crawling
$ Input: startUrl = "http://news.yahoo.com/news/topics/" urls = ["http://news.yahoo.com/news/topics/","http://news.yahoo.com/news","http://news.yahoo.com/news/topics/business"] edges = [[2,0],[2,1],[3,2],[3,1],[0,4]] note: edges[i] means page urls[i] contains links to pages urls[edges[i]]
โ€บ Output: ["http://news.yahoo.com/news/topics/","http://news.yahoo.com/news","http://news.yahoo.com/news/topics/business"]
๐Ÿ’ก Note: Starting from the topics page, the crawler discovers the main news page and business topics page. All URLs share the same hostname 'news.yahoo.com' so they are all included in the result.
example_2.py โ€” Mixed Domains
$ Input: startUrl = "http://news.yahoo.com/news/topics/" urls = ["http://news.yahoo.com/news/topics/","http://news.yahoo.com/news","http://sports.yahoo.com/news","http://news.google.com"] edges = [[0,2],[2,1],[3,2],[3,1],[3,0]] note: page 0 links to page 2, page 2 links to page 1, etc.
โ€บ Output: ["http://news.yahoo.com/news/topics/","http://news.yahoo.com/news"]
๐Ÿ’ก Note: Only URLs with hostname 'news.yahoo.com' are crawled. The sports.yahoo.com and news.google.com URLs are filtered out as they have different hostnames.
example_3.py โ€” Single Page Website
$ Input: startUrl = "http://www.example.com/" urls = ["http://www.example.com/"] edges = [[]] note: The single page has no outbound links
โ€บ Output: ["http://www.example.com/"]
๐Ÿ’ก Note: Edge case where the website has only one page with no links to other pages. The crawler returns just the starting URL.

Visualization

Tap to expand
Web Crawler as Graph Traversalnews.yahoo.com domainstartnewstopicssportsExternal domainsgooglecnnbbcโœ—BFS Algorithm Steps1. Initialize: queue = [startUrl], visited = {startUrl}2. While queue not empty: currentUrl = queue.dequeue()3. For each link: if same_domain(link) and link not in visited4. Add to both: visited.add(link), queue.enqueue(link)
Understanding the Visualization
1
Graph Structure
Each webpage is a node, each link is a directed edge to another node
2
Domain Boundary
Only traverse edges that lead to nodes in the same domain (hostname)
3
Visited Tracking
Use HashSet to mark visited nodes and avoid cycles
4
BFS Traversal
Use queue to explore nodes level by level systematically
Key Takeaway
๐ŸŽฏ Key Insight: Web crawling is essentially a graph traversal problem with domain boundary constraints. BFS + HashSet gives us optimal O(V + E) performance with systematic exploration.

Time & Space Complexity

Time Complexity
โฑ๏ธ
O(V + E)

Visit each URL once (V) and traverse each link once (E)

n
2n
โœ“ Linear Growth
Space Complexity
O(V)

HashSet and queue store at most V unique URLs

n
2n
โœ“ Linear Space

Constraints

  • 1 โ‰ค urls.length โ‰ค 1000
  • 1 โ‰ค urls[i].length โ‰ค 300
  • startUrl is one of the urls
  • Hostname label must contain only lowercase letters, digits or '-'. It cannot start or end with '-'.
  • See examples for format of urls and edges.
  • All URLs will be in http:// format without port
Asked in
Google 85 Amazon 62 Facebook 45 Microsoft 38
89.2K Views
High Frequency
~25 min Avg. Time
1.8K Likes
Ln 1, Col 1
Smart Actions
๐Ÿ’ก Explanation
AI Ready
๐Ÿ’ก Suggestion Tab to accept Esc to dismiss
// Output will appear here after running code
Code Editor Closed
Click the red button to reopen