Imagine you're building the core engine for a search engine that needs to systematically explore and index web pages. Your task is to implement a web crawler that starts from a given URL and discovers all pages within the same domain.
Given a starting URL startUrl and an HtmlParser interface, your crawler must:
- Start crawling from
startUrl - Extract links using
HtmlParser.getUrls(url)to get all URLs from any webpage - Stay within domain - only crawl URLs that share the same hostname as
startUrl - Avoid duplicates - never crawl the same URL twice
- Return all discovered URLs in any order
Example: If startUrl is http://news.yahoo.com/news, then http://news.yahoo.com/sports should be crawled, but http://sports.yahoo.com/news should not (different subdomain).
Important: URLs with and without trailing slashes are considered different (e.g., http://news.yahoo.com vs http://news.yahoo.com/).
The HtmlParser interface is provided:
interface HtmlParser {
public List<String> getUrls(String url);
} Input & Output
Visualization
Time & Space Complexity
Visit each URL once (V) and traverse each link once (E)
HashSet and queue store at most V unique URLs
Constraints
- 1 โค urls.length โค 1000
- 1 โค urls[i].length โค 300
- startUrl is one of the urls
- Hostname label must contain only lowercase letters, digits or '-'. It cannot start or end with '-'.
- See examples for format of urls and edges.
- All URLs will be in http:// format without port