How to Find Duplicate Files in Unix?

Introduction

As we increasingly depend on digital media for storing our important files, we tend to accumulate a large number of files over time. It can be challenging to manage these files, particularly when we have multiple copies of the same file that can consume storage space. To solve this issue, Unix provides several methods to find and remove duplicate files, saving us time and disk space.

In this article, we will explore various approaches to find duplicate files in Unix and demonstrate the terminal commands that can be used for each method. With these various approaches, we can choose the method that best suits our needs and preferences, depending on the type and amount of data we need to manage.

Find Duplicate Files Using fdupes

It is a terminal tool that allows us to find duplicate files recursively in a directory tree. The tool is available on most Unix-like systems.

Open terminal window and navigate to the a directory to scan for duplicate files ?

$ cd Desktop/duplicate

Then, type the command to run fdupes and find duplicate files ?

fdupes -r .

The option -r tells fdupes to scan the current directory and its subfirectories over and over. The '.' defines the present directory as the search's beginning point. After running the command, fdupes will look at all files in the directory tree and return a list of same files.

./folder2/hello.txt                           
./folder1/hello.txt
./folder3/hello.txt

Find Duplicate Files Using jdupes

It is a command-line tool used to identify and manage duplicate files. It compares files based on various criteria such as name, size, modification time, and content, and can identify duplicates even if they have different names or are located in different directories. jdupes is a more advanced version of the fdupes tool.

To use this type this command in terminal ?

$ jdupes -r .

Here, the -r option instructs jdupes to scan the current directory and its subdirectories. The present directory is defined as the beginning in this case by the "." size. When the command is run, it will scan all files in the directory and print same files only.

Scanning: 9 files, 4 items (in 1 specified)
** scanning files **
./folder1/hello.txt                                               
./folder2/hello.txt
./folder3/hello.txt

Find Duplicate Files Using Awk Tool

The utility named "awk" is a very powerful tool to find the same files. We can use this to quickly scan a directory structure, find files with similar titles, and display them along their paths.

$ awk -F'/' '{
   f = $NF
   arr[f] = f in arr? arr[f] RS $0 : $0
   bb[f]++ } 
   END{for(x in bb)
      if(bb[x]>1)
      printf "Name of duplicate files: %s 
 % s
", x,arr[x] }'<(find . -type f)

Here, this script scans each file path using the forward slash as a separator, gets the filename using $NF, and tests if it exists in the "arr" . If it does, the path is added, and the "bb" collection calculates the number of times each filename appears.

Name of duplicate files: unique.txt

./folder2/unique.txt
./folder1/unique.txt
./folder3/unique.txt

Name of duplicate files: hello.txt

./folder2/hello.txt

./folder1/hello.txt

./folder3/hello.txt

Find Duplicate Files by Size Using Awk

We can also use this awk utility to find files with same size as duplicate files will have same size.

$ awk '{
   fsize = $1
   fpath[fsize] = fsize in fpath ? fpath[fsize] RS $2 : $2
   count[fsize]++ 
} 
END{for(size in count)
   if(count[size]>1) 
      printf "Using the awk to find Duplicate files by size: %d  bytes
%s
",size,fpath[size] }' <(find . -type f -exec du -b {} +)

This will gets the size of each file and saves it in the "fsize" variable. If this size already appears in the "fpath" array, it appends the current file path to the existing collection of paths. The count array keeps account of the number of times each file size appears. The command at last loops through the count array and prints the name of same files.

Using the awk to find Duplicate files by size: 13 bytes

./folder2/unique.txt
./folder3/unique.txt
 ** many more duplicate files**

 Using the awk to find Duplicate files by size: 20 bytes

./folder2/hello.txt
./folder1/hello.txt
./folder3/hello.txt

Conclusion

From this article, we learned that Unix operating system provides several efficient ways to find and remove duplicate files, such as using command-line tools like fdupes, jdupes, awk, and the find command. By using these methods, we can efficiently manage our files and save valuable disk space.

Bamdeb Ghosh

Updated on: 2023-05-08T11:16:26+05:30

2K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started