Find Duplicate File in System - Problem

Imagine you're a system administrator managing a large file system with thousands of files scattered across multiple directories. Over time, users have created duplicate files with identical content but stored in different locations, wasting precious storage space.

Your mission: Given a list of directory information strings, identify all groups of duplicate files that have the same content.

Each directory info string follows this format:
"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"

This means:

  • Directory path: root/d1/d2/.../dm
  • Files with their content in parentheses: f1.txt(f1_content), etc.
  • If m = 0, the directory is just the root

Goal: Return all groups of duplicate files, where each group contains at least 2 files with identical content. Each file should be represented by its full path: "directory_path/file_name.txt"

Example: If two files have content "hello world", group them together regardless of their location in the file system.

Input & Output

example_1.py โ€” Basic case with duplicates
$ Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)", "root 4.txt(efgh)"]
โ€บ Output: [["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]
๐Ÿ’ก Note: Files with content 'efgh': root/a/2.txt, root/c/d/4.txt, root/4.txt form one group. Files with content 'abcd': root/a/1.txt, root/c/3.txt form another group.
example_2.py โ€” No duplicates case
$ Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(hijk)"]
โ€บ Output: []
๐Ÿ’ก Note: All files have unique content, so no duplicate groups are found.
example_3.py โ€” Multiple files same directory
$ Input: paths = ["root 1.txt(same) 2.txt(same) 3.txt(different)"]
โ€บ Output: [["root/1.txt","root/2.txt"]]
๐Ÿ’ก Note: Two files in root directory have identical content 'same', forming one duplicate group.

Visualization

Tap to expand
File System Duplicate DetectionFile System Structure:root/a.txt (abc)b.txt (def)sub/c.txt (abc)d.txt (def)Hash Map Processing:Content: "abc"Files: [root/a.txt, sub/c.txt]โœ“ Duplicate GroupContent: "def"Files: [root/b.txt, sub/d.txt]โœ“ Duplicate GroupResult Groups:Group 1[root/a.txt, sub/c.txt]Content: "abc"Group 2[root/b.txt, sub/d.txt]Content: "def"Algorithm Workflow1Parse FilesExtract paths & content2Hash GroupingGroup by content3Find DuplicatesIdentify groups 2+4Return GroupsBuild final resultTime: O(n) | Space: O(n)n = total number of files๐ŸŽฏ Key InsightContent is the universal identifier - files with identical content are duplicatesHash maps enable O(1) grouping, making the solution optimal for large file systems
Understanding the Visualization
1
Parse Directory Structure
Extract file paths and contents from directory strings, like cataloging books and their content
2
Hash Map Grouping
Use content as key to group files - like organizing books by their actual text content
3
Identify Duplicates
Files with same content are duplicates - like finding books with identical text but different covers
4
Build Result Groups
Return groups with 2+ files - the sets of duplicate files to consider for cleanup
Key Takeaway
๐ŸŽฏ Key Insight: Hash maps allow us to group files by content in O(1) time per file, making duplicate detection efficient even for massive file systems with thousands of files.

Time & Space Complexity

Time Complexity
โฑ๏ธ
O(n * m)

n is total number of files, m is average content length for string operations. Each file is processed once.

n
2n
โœ“ Linear Growth
Space Complexity
O(n * m)

Space for hash map storing all file paths and contents, plus result groups

n
2n
โšก Linearithmic Space

Constraints

  • 1 โ‰ค paths.length โ‰ค 2 ร— 104
  • 1 โ‰ค paths[i].length โ‰ค 3000
  • 1 โ‰ค sum of all file contents length โ‰ค 5 ร— 105
  • paths[i] has the format "dir file1.txt(content1) file2.txt(content2) ... fileN.txt(contentN)"
  • Answer can be returned in any order
Asked in
Google 45 Amazon 38 Dropbox 32 Microsoft 28
42.0K Views
Medium Frequency
~18 min Avg. Time
1.4K Likes
Ln 1, Col 1
Smart Actions
๐Ÿ’ก Explanation
AI Ready
๐Ÿ’ก Suggestion Tab to accept Esc to dismiss
// Output will appear here after running code
Code Editor Closed
Click the red button to reopen