Find Duplicate File in System

Find Duplicate File in System - Problem

Given a list paths of directory info, including the directory path, and all the files with contents in this directory, return all the duplicate files in the file system in terms of their paths.

You may return the answer in any order.

A group of duplicate files consists of at least two files that have the same content.

A single directory info string in the input list has the following format:

"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"

It means there are n files (f1.txt, f2.txt ... fn.txt) with content (f1_content, f2_content ... fn_content) respectively in the directory "root/d1/d2/.../dm".

Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory.

The output is a list of groups of duplicate file paths. For each group, it contains all the file paths of the files that have the same content.

A file path is a string that has the following format: "directory_path/file_name.txt"

Input & Output

Example 1 — Basic Duplicate Detection

$ Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)", "root 4.txt(efgh)"]

› Output: [["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]

💡 Note: Files with content 'efgh': root/a/2.txt, root/c/d/4.txt, root/4.txt. Files with content 'abcd': root/a/1.txt, root/c/3.txt

Example 2 — No Duplicates

$ Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(ijkl)"]

› Output: []

💡 Note: All files have unique content, so no duplicates exist

Example 3 — Single Directory

$ Input: paths = ["root 1.txt(same) 2.txt(same) 3.txt(different)"]

› Output: [["root/1.txt","root/2.txt"]]

💡 Note: Only files with 'same' content are duplicates: root/1.txt and root/2.txt

Constraints

1 ≤ paths.length ≤ 2 × 10⁴
1 ≤ sum of all paths[i].length ≤ 5 × 10⁵
paths[i] consists of English letters, digits, '/', '.', '(', ')', and ' '.
You may assume no files or directories share the same name in the same directory.
You may assume each given directory info represents a unique directory. A single blank space separates the directory path and file info.

Visualization

Tap to expand

Asked in

D Dropbox 25 G Google 20 a Amazon 15

The key insight is to use a hash map where file content is the key and list of file paths is the value. This automatically groups files with identical content. Parse each directory string to extract file paths and contents, then return groups with more than one file. Best approach is hash map grouping with Time: O(n), Space: O(n).

Common Approaches

✓ Greedy

⏱️ Time: N/A Space: N/A

Optimized

⏱️ Time: N/A Space: N/A

Brute Force Comparison

⏱️ Time: O(n²) Space: O(n)

Parse all files first, then compare each file's content with every other file's content to identify duplicates. Group files with matching content together.

Hash Map Grouping

⏱️ Time: O(n) Space: O(n)

Parse all directory strings to extract files, then use a hash map where content is the key and list of file paths is the value. Files with same content automatically get grouped together.

Algorithm Steps — Algorithm Steps

Code -

solution.c — C

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_FILES 1000
#define MAX_PATH_LEN 500
#define MAX_CONTENT_LEN 100

typedef struct {
    char content[MAX_CONTENT_LEN];
    char files[MAX_FILES][MAX_PATH_LEN];
    int count;
} ContentGroup;

static ContentGroup groups[MAX_FILES];
static int groupCount = 0;

void parseString(const char* input, char result[][MAX_PATH_LEN], int* count) {
    *count = 0;
    int len = strlen(input);
    int i = 1; // Skip opening bracket
    
    while (i < len - 1) { // Skip closing bracket
        // Skip whitespace and commas
        while (i < len && (input[i] == ' ' || input[i] == ',')) i++;
        if (i >= len - 1) break;
        
        // Skip opening quote
        if (input[i] == '"') i++;
        
        // Read until closing quote
        int start = i;
        while (i < len && input[i] != '"') i++;
        
        // Copy the string
        int length = i - start;
        strncpy(result[*count], &input[start], length);
        result[*count][length] = '\0';
        (*count)++;
        
        // Skip closing quote
        if (input[i] == '"') i++;
    }
}

int findContentGroup(const char* content) {
    for (int i = 0; i < groupCount; i++) {
        if (strcmp(groups[i].content, content) == 0) {
            return i;
        }
    }
    return -1;
}

void addToGroup(const char* content, const char* filepath) {
    int groupIdx = findContentGroup(content);
    if (groupIdx == -1) {
        // Create new group
        strcpy(groups[groupCount].content, content);
        strcpy(groups[groupCount].files[0], filepath);
        groups[groupCount].count = 1;
        groupCount++;
    } else {
        // Add to existing group
        strcpy(groups[groupIdx].files[groups[groupIdx].count], filepath);
        groups[groupIdx].count++;
    }
}

void solution(char paths[][MAX_PATH_LEN], int pathCount) {
    groupCount = 0;
    
    for (int p = 0; p < pathCount; p++) {
        char* pathInfo = paths[p];
        
        // Find first space to separate directory from files
        char* firstSpace = strchr(pathInfo, ' ');
        if (!firstSpace) continue;
        
        // Extract directory
        char directory[MAX_PATH_LEN];
        int dirLen = firstSpace - pathInfo;
        strncpy(directory, pathInfo, dirLen);
        directory[dirLen] = '\0';
        
        // Process files
        char* current = firstSpace + 1;
        while (*current) {
            // Skip spaces
            while (*current == ' ') current++;
            if (!*current) break;
            
            // Find the file info end (next space or end of string)
            char* nextSpace = strchr(current, ' ');
            char fileInfo[MAX_PATH_LEN];
            if (nextSpace) {
                int len = nextSpace - current;
                strncpy(fileInfo, current, len);
                fileInfo[len] = '\0';
                current = nextSpace + 1;
            } else {
                strcpy(fileInfo, current);
                current += strlen(current);
            }
            
            // Parse filename and content
            char* parenStart = strchr(fileInfo, '(');
            if (!parenStart) continue;
            
            char filename[MAX_PATH_LEN];
            char content[MAX_CONTENT_LEN];
            
            // Extract filename
            int filenameLen = parenStart - fileInfo;
            strncpy(filename, fileInfo, filenameLen);
            filename[filenameLen] = '\0';
            
            // Extract content (skip opening paren, stop at closing paren)
            char* parenEnd = strchr(parenStart + 1, ')');
            if (!parenEnd) continue;
            
            int contentLen = parenEnd - parenStart - 1;
            strncpy(content, parenStart + 1, contentLen);
            content[contentLen] = '\0';
            
            // Create full path
            char fullPath[MAX_PATH_LEN];
            sprintf(fullPath, "%s/%s", directory, filename);
            
            // Add to content group
            addToGroup(content, fullPath);
        }
    }
}

int main() {
    char input[10000];
    fgets(input, sizeof(input), stdin);
    
    char paths[MAX_FILES][MAX_PATH_LEN];
    int pathCount;
    parseString(input, paths, &pathCount);
    
    solution(paths, pathCount);
    
    // Print result
    printf("[");
    int first = 1;
    for (int i = 0; i < groupCount; i++) {
        if (groups[i].count >= 2) {
            if (!first) printf(",");
            first = 0;
            
            printf("[");
            for (int j = 0; j < groups[i].count; j++) {
                if (j > 0) printf(",");
                printf("\"%s\"", groups[i].files[j]);
            }
            printf("]");
        }
    }
    printf("]\n");
    
    return 0;
}

Time & Space Complexity

Time Complexity

⏱️

✓ Linear Growth

Space Complexity

✓ Linear Space

28.0K Views

Medium Frequency

~25 min Avg. Time

892 Likes

Ln 1, Col 1

Smart Actions

💡 Explanation

AI Ready

💡 Suggestion Tab to accept Esc to dismiss

// Output will appear here after running code

Code Editor Closed

Click the red button to reopen