Finding Duplicate Files with Python


Copy documents, which consume additional room on the hard drive as we assemble information on our PC, are a commonplace event and therefore, finding and removing duplicate files manually can be time-consuming and tiresome; Thankfully, we are able to automate this procedure with Python so in this lesson we will take a look at exactly how to do that with a short script.

Setup

For this tutorial, we'll use the built-in os and hashlib modules in Python, so no additional packages need to be installed.

import os
import hashlib

Algorithm

  • Compose a function called hashfile() that utilizes the SHA-1 calculation to produce a document with explicit hash esteem which accepts a filename as info and returns a hash esteem.

  • Define the method find_duplicates(), which allows searching for duplicate files in a specific directory; It generates a series of lists containing the file locations of duplicate files for each parameter, such as a directory name.

  • Inside the track down copies() technique, store the hash upsides of every catalog document in a word reference and iteratively figure a hash as an incentive for every registry record utilizing the hashfile() strategy.

  • On the off chance that the hash esteem is as of now in the word reference, the record way will be added to the rundown of copy documents. If not, the dictionary will include the hash value and file location.

  • Since these inward records don't actually incorporate copy document ways, channel them away.

Example

import os
import hashlib
def hashfile(filename):
   with open(filename, 'rb') as f:
      hasher = hashlib.sha1()
      while True:
         data = f.read(1024)
         if not data:
            break
         hasher.update(data)
      return hasher.hexdigest()

This code creates the method hashfile() with the assistance of the Python hashlib module and the above method returns a SHA-1 hash value for the file and takes a filename as its input. When reading 1024 bytes at a time from the file, the function alters the hash value for each chunk. The capability returns the hash worth's hexdigest as a string in the wake of perusing the full record.

def find_duplicates(dirname):
   files = os.listdir(dirname)
   if len(files) < 2:
      return
   hashes = {}
   for filename in files:
      path = os.path.join(dirname, filename)
      if not os.path.isfile(path):
         continue
      file_hash = hashfile(path)
      if file_hash not in hashes:
         hashes[file_hash] = path
      else:
         print(f'Duplicate found: {path} and {hashes[file_hash]}')

The method find duplicates() in this code defines a list of lists that contains the file paths of duplicate files and accepts a directory name as input. The function creates an empty list of duplicates and an empty dictionary file to contain the paths of duplicate files and the hash values of each file in the directory, respectively. Using the os.listdir() method, the code iterates over each file in the directory and uses the os.path.isfile() function to determine if it is a file.

The function creates a hash value for each file using the hashfile() function previously defined and checks to see whether the hash value is already present in the file's dictionary. The function adds the current file path and the path of the preceding file with the same hash value to the list of duplicates if the hash value already exists. A list of all the duplicates discovered is what the function finally delivers.

Finally call the method to run the entire script −

if __name__ == '__main__':
   show_duplicates('.')

Explanation

So the whole code can be written as −

import os
import hashlib
def hashfile(filename):
   with open(filename, 'rb') as f:
      hasher = hashlib.sha1()
      while True:
         data = f.read(1024)
         if not data:
            break
         hasher.update(data)
      return hasher.hexdigest()
 
def find_duplicates(dirname):
   files = os.listdir(dirname)
   if len(files) < 2:
      return
   hashes = {}
   for filename in files:
      path = os.path.join(dirname, filename)
      if not os.path.isfile(path):
         continue
      file_hash = hashfile(path)
      if file_hash not in hashes:
         hashes[file_hash] = path
      else:
         print(f'Duplicate found: {path} and {hashes[file_hash]}')
            
if __name__ == '__main__':
   show_duplicates('.')

Output

[You can copy paste several files in the directory that you run the script in to get the desired output]

Duplicate found: .\DSA-guide-2023.pdf and .\DSA-guide-2023 - Copy.pdf
  • We start by bringing in the os module, which furnishes a method for collaborating with the working framework.

  • The find_duplicates function is then defined, which returns a list of all duplicate files found in a given directory and takes a directory as its input.

  • The dictionary that will be used to store the hash values of each file is the hash_dict variable.

  • The paths of any files that are discovered to be duplicates will be stored in a list in the duplicates variable.

  • Then, we go through the directory and its subdirectories with the os.walk function.

  • For each document in the registry, we utilize the hash_file capability to compute its hash esteem.

  • Then, we examine the hash_dict dictionary to see if the hash value already exists there. In the event that it does, the path to the current file and the path to the file before it with the same hash value are added to the duplicates list. The hash value and file path are added to the hash_dict dictionary if it does not already exist.

  • Send back the duplicates list, which includes all of the duplicate files that were found and print it out.

Conclusion

Even if it could involve some expenditure, recognising copies of papers is essential for maintaining the order and neatness of our frameworks. In this post, we've shown how to utilize Python's os module and hashlib module to find duplicate copies of documents in a registry. The os module streamlines operating system communication, while the hashlib module allows us access to a file's hash value. We were able to construct the "find duplicates" function, which can locate duplicate files in a directory, by merging these two modules.

Updated on: 21-Aug-2023

973 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements