Web Crawler - Problem

Given a startUrl and an interface HtmlParser, implement a web crawler to crawl all links that are under the same hostname as startUrl.

Your crawler should:

Start from the page: startUrl
Call HtmlParser.getUrls(url) to get all urls from a webpage
Do not crawl the same link twice
Explore only the links that are under the same hostname as startUrl

For example, if startUrl = "http://news.yahoo.com/news/topics/", then the hostname is news.yahoo.com. You should only crawl URLs like http://news.yahoo.com/...

Note: Consider the same URL with trailing slash "/" as different. For example, "http://news.yahoo.com" and "http://news.yahoo.com/" are different urls.

Input & Output

Example 1 — Basic Web Crawling

$ Input: startUrl = "http://news.yahoo.com", urls = ["http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/", "http://news.google.com"], edges = [[0,2],[2,1],[3,2],[3,1],[1,4]]

› Output: ["http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/"]

💡 Note: Starting from "http://news.yahoo.com", we can reach pages with the same hostname "news.yahoo.com". The Google URL is excluded as it has different hostname.

Example 2 — Single Page

$ Input: startUrl = "http://news.yahoo.com", urls = ["http://news.yahoo.com"], edges = []

› Output: ["http://news.yahoo.com"]

💡 Note: Only the start URL with no outgoing links, so result contains just the start page.

Example 3 — Different Hostname Filtering

$ Input: startUrl = "http://example.com/page", urls = ["http://example.com/page", "http://example.com/about", "http://other.com/page"], edges = [[0,1],[0,2]]

› Output: ["http://example.com/page", "http://example.com/about"]

💡 Note: From start page, both links are found but only the example.com/about is included as other.com has different hostname.

Constraints

1 ≤ urls.length ≤ 1000
1 ≤ urls[i].length ≤ 300
startUrl is one of the urls
All URLs follow the format http://hostname/path without port

Visualization

Tap to expand

Asked in

G Google 45 f Facebook 35 a Amazon 28 M Microsoft 22

The key insight is to extract the hostname once and use graph traversal (DFS or BFS) with a visited set to explore all reachable URLs in the same domain. Best approach is Optimized DFS with adjacency list: Time O(N + E), Space O(N).

Common Approaches

✓ Breadth-First Search

⏱️ Time: O(N + E) Space: O(N)

Extract hostname once and use BFS to systematically explore all URLs. Use a queue to visit URLs level by level, ensuring we don't miss any connected URLs.

Naive DFS without Optimization

⏱️ Time: O(N × M) Space: O(N)

Use DFS to explore all links, but extract hostname by parsing the entire URL string each time. This approach works but is inefficient for hostname checking.

Optimized Depth-First Search

⏱️ Time: O(N + E) Space: O(N + H)

Extract hostname once at the beginning and use DFS to explore all connected URLs. This approach is more efficient as it avoids repeated hostname parsing and uses call stack naturally.

Breadth-First Search — Algorithm Steps

Extract hostname from startUrl once
Initialize queue with startUrl and visited set
While queue not empty, process current URL
Add valid neighbors to queue and visited set

Visualization

Tap to expand

Step-by-Step Walkthrough

Initialize

Start with queue containing startUrl

Process

Dequeue URL, check neighbors, add valid ones to queue

Result

All URLs in same hostname found

Code -

solution.c — C

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>

#define MAX_URLS 1000
#define MAX_EDGES 10000
#define MAX_URL_LEN 1000
#define MAX_VISITED 1000

void getHostname(const char* url, char* hostname) {
    const char* start = strstr(url, "//") + 2;
    const char* end = strchr(start, '/');
    if (end == NULL) end = start + strlen(start);
    int len = end - start;
    strncpy(hostname, start, len);
    hostname[len] = '\0';
}

int getUrls(const char* url, char urls[][MAX_URL_LEN], int urlsSize, int edges[][2], int edgesSize, char result[][MAX_URL_LEN]) {
    int resultSize = 0;
    int urlIndex = -1;
    for (int i = 0; i < urlsSize; i++) {
        if (strcmp(urls[i], url) == 0) {
            urlIndex = i;
            break;
        }
    }
    for (int i = 0; i < edgesSize; i++) {
        if (edges[i][0] == urlIndex) {
            strcpy(result[resultSize], urls[edges[i][1]]);
            resultSize++;
        }
    }
    return resultSize;
}

int solution(char* startUrl, char urls[][MAX_URL_LEN], int urlsSize, int edges[][2], int edgesSize, char result[][MAX_URL_LEN]) {
    char targetHostname[MAX_URL_LEN];
    getHostname(startUrl, targetHostname);

    char visited[MAX_VISITED][MAX_URL_LEN];
    int visitedSize = 0;

    char queue[MAX_VISITED][MAX_URL_LEN];
    int queueSize = 1;
    int queueStart = 0;

    strcpy(visited[visitedSize], startUrl);
    visitedSize++;
    strcpy(queue[0], startUrl);

    while (queueSize > 0) {
        char url[MAX_URL_LEN];
        strcpy(url, queue[queueStart]);
        queueStart++;
        queueSize--;

        char neighbors[MAX_URLS][MAX_URL_LEN];
        int neighborsSize = getUrls(url, urls, urlsSize, edges, edgesSize, neighbors);

        for (int i = 0; i < neighborsSize; i++) {
            bool alreadyVisited = false;
            for (int j = 0; j < visitedSize; j++) {
                if (strcmp(visited[j], neighbors[i]) == 0) {
                    alreadyVisited = true;
                    break;
                }
            }

            char neighborHostname[MAX_URL_LEN];
            getHostname(neighbors[i], neighborHostname);

            if (!alreadyVisited && strcmp(neighborHostname, targetHostname) == 0) {
                strcpy(visited[visitedSize], neighbors[i]);
                visitedSize++;
                strcpy(queue[queueStart + queueSize], neighbors[i]);
                queueSize++;
            }
        }
    }

    for (int i = 0; i < visitedSize; i++) {
        strcpy(result[i], visited[i]);
    }
    return visitedSize;
}

char readChar() {
    char ch;
    if (fread(&ch, 1, 1, stdin) == 0) return '\0';
    return ch;
}

void readLine(char* buf) {
    int i = 0;
    char ch;
    while ((ch = readChar()) != '\n' && ch != '\0') {
        buf[i++] = ch;
    }
    buf[i] = '\0';
}

void parseUrls(const char* str, char urls[][MAX_URL_LEN], int* size) {
    *size = 0;
    const char* p = str;

    // find opening [
    while (*p && *p != '[') p++;
    if (*p == '[') p++;

    while (*p && *p != ']') {
        // skip spaces and commas
        while (*p == ' ' || *p == ',') p++;
        if (*p == ']' || *p == '\0') break;

        // skip opening quote
        if (*p == '"') p++;

        // read until closing quote
        const char* start = p;
        while (*p && *p != '"') p++;
        int len = p - start;
        strncpy(urls[*size], start, len);
        urls[*size][len] = '\0';
        (*size)++;

        // skip closing quote
        if (*p == '"') p++;
    }
}

void parseEdges(const char* str, int edges[][2], int* size) {
    *size = 0;
    const char* p = str;

    // find opening [
    while (*p && *p != '[') p++;
    if (*p == '[') p++;

    while (*p && *p != ']') {
        // find next inner [
        while (*p && *p != '[' && *p != ']') p++;
        if (*p == ']') break;
        if (*p == '[') p++;

        // skip spaces
        while (*p == ' ') p++;

        // parse first number
        int sign = 1;
        if (*p == '-') { sign = -1; p++; }
        int num = 0;
        while (*p >= '0' && *p <= '9') {
            num = num * 10 + (*p - '0');
            p++;
        }
        edges[*size][0] = sign * num;

        // skip comma and spaces
        while (*p == ' ' || *p == ',') p++;

        // parse second number
        sign = 1;
        if (*p == '-') { sign = -1; p++; }
        num = 0;
        while (*p >= '0' && *p <= '9') {
            num = num * 10 + (*p - '0');
            p++;
        }
        edges[*size][1] = sign * num;
        (*size)++;

        // skip closing ]
        while (*p && *p != ']') p++;
        if (*p == ']') p++;
    }
}

int main() {
    char startUrl[MAX_URL_LEN];
    char urlsLine[10000];
    char edgesLine[20000];

    // Manual parsing char by char
    readLine(startUrl);
    readLine(urlsLine);
    readLine(edgesLine);

    char urls[MAX_URLS][MAX_URL_LEN];
    int urlsSize = 0;
    parseUrls(urlsLine, urls, &urlsSize);

    int edges[MAX_EDGES][2];
    int edgesSize = 0;
    parseEdges(edgesLine, edges, &edgesSize);

    char result[MAX_VISITED][MAX_URL_LEN];
    int resultSize = solution(startUrl, urls, urlsSize, edges, edgesSize, result);

    printf("[");
    for (int i = 0; i < resultSize; i++) {
        printf("\"%s\"", result[i]);
        if (i < resultSize - 1) printf(", ");
    }
    printf("]\n");

    return 0;
}

Time & Space Complexity

Time Complexity

⏱️

O(N + E)

Visit N URLs and traverse E edges once

✓ Linear Growth

Space Complexity

O(N)

Queue, visited set, and result list store up to N URLs

✓ Linear Space

28.5K Views

Medium Frequency

~25 min Avg. Time

892 Likes

Ln 1, Col 1

Smart Actions

💡 Explanation

AI Ready

💡 Suggestion Tab to accept Esc to dismiss

// Output will appear here after running code

Code Editor Closed

Click the red button to reopen

Web Crawler - Problem

Input & Output

Constraints

Visualization

Related Problems

Common Approaches

Breadth-First Search — Algorithm Steps

Visualization

Code -

Time & Space Complexity

Select Compiler