Web Crawler Multithreaded - Problem

Given a URL startUrl and an interface HtmlParser, implement a multi-threaded web crawler to crawl all links that are under the same hostname as startUrl.

Your crawler should:

Start from the page: startUrl
Call HtmlParser.getUrls(url) to get all URLs from a webpage
Do not crawl the same link twice
Explore only the links that are under the same hostname as startUrl

The HtmlParser interface is defined as:

interface HtmlParser {
    public List<String> getUrls(String url);
}

Note: getUrls(url) is a blocking call that simulates performing an HTTP request. Single-threaded solutions will exceed the time limit, so you need a multi-threaded solution.

Input & Output

Example 1 — Basic Website Crawling

$ Input: startUrl = "http://news.yahoo.com/news/topics/", htmlParser returns {"http://news.yahoo.com/news/topics/": ["http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/business"]}

› Output: ["http://news.yahoo.com/news/topics/", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/business"]

💡 Note: Start with the given URL, crawl it to find 2 links with same hostname, and return all 3 URLs found

Example 2 — Single Page

$ Input: startUrl = "http://example.com", htmlParser returns {"http://example.com": ["http://other.com/page"]}

› Output: ["http://example.com"]

💡 Note: Only the start URL has the correct hostname, other link is from different domain so ignored

Example 3 — Circular References

$ Input: startUrl = "http://test.com", htmlParser returns {"http://test.com": ["http://test.com/page"], "http://test.com/page": ["http://test.com"]}

› Output: ["http://test.com", "http://test.com/page"]

💡 Note: Both pages link to each other, but visited set prevents infinite crawling

Constraints

1 ≤ urls.length ≤ 1000
1 ≤ urls[i].length ≤ 300
startUrl is one of the urls
Hostname label must be from 1 to 63 characters long

Visualization

Tap to expand

Asked in

G Google 15 f Facebook 12 a Amazon 8

The key insight is to use multi-threading to overcome the blocking nature of HTTP requests. Best approach uses thread pools with thread-safe data structures to crawl URLs concurrently while avoiding race conditions. Time: O(N/T), Space: O(N)

Common Approaches

✓ Multi-threaded with Thread Pool

⏱️ Time: O(N/T) Space: O(N + T)

Create a fixed thread pool to process URLs concurrently. Use thread-safe data structures like ConcurrentHashMap and synchronization mechanisms to coordinate between threads and avoid race conditions.

Single-threaded BFS

⏱️ Time: O(N) Space: O(N)

Use a single thread to crawl pages one by one using breadth-first search. Maintain a visited set to avoid duplicates and only process URLs with the same hostname.

Multi-threaded with Thread Pool — Algorithm Steps

Create thread pool with fixed number of threads
Use thread-safe visited set and result collection
Submit crawling tasks to thread pool
Use synchronization to wait for all tasks completion

Visualization

Tap to expand

Step-by-Step Walkthrough

Initialize

Create thread pool and thread-safe data structures

Parallel Crawl

Multiple threads process URLs simultaneously

Synchronize

Coordinate results and avoid race conditions

Code -

solution.c — C

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_URLS 1000
#define MAX_URL_LEN 200

static char visited[MAX_URLS][MAX_URL_LEN];
static char result[MAX_URLS][MAX_URL_LEN];
static int visitedCount = 0;
static int resultCount = 0;

typedef struct {
    char urls[MAX_URLS][MAX_URL_LEN];
    int count;
} UrlList;

typedef struct {
    char url[MAX_URL_LEN];
    UrlList links;
} UrlMapping;

typedef struct {
    UrlMapping mappings[MAX_URLS];
    int count;
} HtmlParser;

static HtmlParser parser;
static char hostname[MAX_URL_LEN];

UrlList getUrls(const char* url) {
    UrlList empty;
    empty.count = 0;
    for (int i = 0; i < parser.count; i++) {
        if (strcmp(parser.mappings[i].url, url) == 0) {
            return parser.mappings[i].links;
        }
    }
    return empty;
}

void getHostname(const char* url, char* host) {
    const char* start = strstr(url, "://");
    if (!start) {
        host[0] = '\0';
        return;
    }
    start += 3;
    const char* end = strchr(start, '/');
    if (!end) end = start + strlen(start);
    int len = end - start;
    strncpy(host, start, len);
    host[len] = '\0';
}

int isVisited(const char* url) {
    for (int i = 0; i < visitedCount; i++) {
        if (strcmp(visited[i], url) == 0) return 1;
    }
    return 0;
}

void crawl(const char* url) {
    if (isVisited(url)) return;
    strcpy(visited[visitedCount++], url);
    strcpy(result[resultCount++], url);
    UrlList urls = getUrls(url);
    for (int i = 0; i < urls.count; i++) {
        char nextHostname[MAX_URL_LEN];
        getHostname(urls.urls[i], nextHostname);
        if (strcmp(nextHostname, hostname) == 0) {
            crawl(urls.urls[i]);
        }
    }
}

int solution(const char* startUrl) {
    getHostname(startUrl, hostname);
    visitedCount = 0;
    resultCount = 0;
    crawl(startUrl);
    return resultCount;
}

// Manual parsing of input format:
// {"url1": ["link1", "link2"], "url2": ["link3"], ...}
void parseUrlsMap(const char* p) {
    parser.count = 0;

    // find opening {
    while (*p && *p != '{') p++;
    if (*p == '{') p++;

    while (*p && *p != '}') {
        // skip whitespace and commas
        while (*p == ' ' || *p == '\t' || *p == '\n' || *p == '\r' || *p == ',') p++;
        if (*p == '}' || *p == '\0') break;

        // read key URL (quoted)
        if (*p == '"') p++;
        const char* keyStart = p;
        while (*p && *p != '"') p++;
        int keyLen = p - keyStart;
        strncpy(parser.mappings[parser.count].url, keyStart, keyLen);
        parser.mappings[parser.count].url[keyLen] = '\0';
        parser.mappings[parser.count].links.count = 0;
        if (*p == '"') p++;

        // skip whitespace and colon
        while (*p == ' ' || *p == '\t' || *p == ':') p++;

        // read value array [...]
        if (*p == '[') p++;
        while (*p && *p != ']') {
            while (*p == ' ' || *p == ',' || *p == '\t') p++;
            if (*p == ']') break;

            // read each linked URL (quoted)
            if (*p == '"') p++;
            const char* valueStart = p;
            while (*p && *p != '"') p++;
            int valueLen = p - valueStart;
            int idx = parser.mappings[parser.count].links.count;
            strncpy(parser.mappings[parser.count].links.urls[idx], valueStart, valueLen);
            parser.mappings[parser.count].links.urls[idx][valueLen] = '\0';
            parser.mappings[parser.count].links.count++;
            if (*p == '"') p++;
        }
        if (*p == ']') p++;
        parser.count++;
    }
}

char readChar() {
    char ch;
    if (fread(&ch, 1, 1, stdin) == 0) return '\0';
    return ch;
}

void readLine(char* buf, int maxLen) {
    int i = 0;
    char ch;
    while ((ch = readChar()) != '\n' && ch != '\0' && i < maxLen - 1) {
        buf[i++] = ch;
    }
    buf[i] = '\0';
}

int main() {
    char startUrl[MAX_URL_LEN];
    char urlsMapStr[100000];

    // Manual char by char reading
    readLine(startUrl, sizeof(startUrl));
    readLine(urlsMapStr, sizeof(urlsMapStr));

    // Strip quotes from startUrl if present
    int len = strlen(startUrl);
    if (len > 0 && startUrl[0] == '"') {
        if (startUrl[len - 1] == '"') startUrl[len - 1] = '\0';
        memmove(startUrl, startUrl + 1, len);
    }

    parseUrlsMap(urlsMapStr);

    int count = solution(startUrl);

    printf("[");
    for (int i = 0; i < count; i++) {
        printf("\"%s\"", result[i]);
        if (i < count - 1) printf(",");
    }
    printf("]\n");

    return 0;
}

Time & Space Complexity

Time Complexity

⏱️

O(N/T)

N URLs processed by T threads in parallel, significantly faster

✓ Linear Growth

Space Complexity

O(N + T)

Store visited URLs plus thread pool overhead

✓ Linear Space

23.5K Views

Medium Frequency

~35 min Avg. Time

892 Likes

Ln 1, Col 1

Smart Actions

💡 Explanation

AI Ready

💡 Suggestion Tab to accept Esc to dismiss

// Output will appear here after running code

Code Editor Closed

Click the red button to reopen

Web Crawler Multithreaded - Problem

Input & Output

Constraints

Visualization

Related Problems

Common Approaches

Multi-threaded with Thread Pool — Algorithm Steps

Visualization

Code -

Time & Space Complexity

Select Compiler