Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to Find Duplicate Files in Unix?
As we increasingly depend on digital media for storing our important files, we tend to accumulate a large number of files over time. It can be challenging to manage these files, particularly when we have multiple copies of the same file that can consume storage space. Unix provides several powerful command-line methods to find and remove duplicate files, saving both time and disk space.
In this article, we will explore various approaches to find duplicate files in Unix and demonstrate the terminal commands that can be used for each method. These approaches allow you to choose the method that best suits your needs depending on the type and amount of data you need to manage.
Find Duplicate Files Using fdupes
fdupes is a specialized terminal tool that allows you to find duplicate files recursively in a directory tree. The tool is available on most Unix-like systems and uses file content comparison for accurate duplicate detection.
Open a terminal window and navigate to the directory you want to scan for duplicate files ?
cd Desktop/duplicate
Then, run fdupes to find duplicate files ?
fdupes -r .
The -r option tells fdupes to scan the current directory and its subdirectories recursively. The . specifies the current directory as the starting point. After running the command, fdupes will examine all files in the directory tree and return a list of duplicate files grouped together.
./folder2/hello.txt ./folder1/hello.txt ./folder3/hello.txt
Additional fdupes Options
fdupes -r -d . # Delete duplicates interactively fdupes -r -S . # Show size of duplicate files fdupes -r -m . # Summarize duplicate file information
Find Duplicate Files Using jdupes
jdupes is an enhanced fork of fdupes with improved performance and additional features. It compares files based on various criteria such as name, size, modification time, and content, and can identify duplicates even if they have different names or are located in different directories.
To use jdupes, run this command in the terminal ?
jdupes -r .
The -r option instructs jdupes to scan the current directory and its subdirectories recursively. When executed, it will scan all files and display duplicate files with additional scanning information.
Scanning: 9 files, 4 items (in 1 specified) ** scanning files ** ./folder1/hello.txt ./folder2/hello.txt ./folder3/hello.txt
Find Duplicate Files Using Awk Tool
The awk utility provides a flexible approach to find duplicate files by filename. This method scans directory structures and identifies files with identical names regardless of their location.
awk -F'/' '{
f = $NF
arr[f] = f in arr? arr[f] RS $0 : $0
bb[f]++ }
END{for(x in bb)
if(bb[x]>1)
printf "Name of duplicate files: %s <br>%s<br>", x,arr[x] }' <(find . -type f)
This script processes each file path using the forward slash as a separator, extracts the filename using $NF, and checks if it exists in the arr array. If it does, the path is appended, and the bb array counts occurrences of each filename.
Name of duplicate files: unique.txt ./folder2/unique.txt ./folder1/unique.txt ./folder3/unique.txt Name of duplicate files: hello.txt ./folder2/hello.txt ./folder1/hello.txt ./folder3/hello.txt
Find Duplicate Files by Size Using Awk
You can also use awk to find files with identical sizes, as duplicate files typically have the same size. This method is faster but less precise than content-based comparison.
awk '{
fsize = $1
fpath[fsize] = fsize in fpath ? fpath[fsize] RS $2 : $2
count[fsize]++
}
END{for(size in count)
if(count[size]>1)
printf "Duplicate files by size: %d bytes<br>%s<br>",size,fpath[size] }' <(find . -type f -exec du -b {} +)
This script retrieves the size of each file and stores it in the fsize variable. If this size already exists in the fpath array, it appends the current file path. The count array tracks how many times each file size appears, and finally displays files with matching sizes.
Duplicate files by size: 13 bytes ./folder2/unique.txt ./folder3/unique.txt Duplicate files by size: 20 bytes ./folder2/hello.txt ./folder1/hello.txt ./folder3/hello.txt
Comparison of Methods
| Method | Accuracy | Speed | Best For |
|---|---|---|---|
| fdupes | High (content-based) | Medium | Precise duplicate detection |
| jdupes | High (content-based) | Fast | Large file collections |
| awk (by name) | Medium | Fast | Same-named files |
| awk (by size) | Low | Very Fast | Quick size-based screening |
Conclusion
Unix provides several efficient methods to find duplicate files, from specialized tools like fdupes and jdupes to flexible scripting with awk. Content-based tools offer the highest accuracy, while size-based methods provide quick screening capabilities. Choose the method that best fits your accuracy requirements and performance needs.
