Finding Duplicate Files with Python

Duplicate files consume unnecessary storage space on our computers. Finding and removing them manually can be time-consuming, but we can automate this process with Python using file hashing techniques.

Required Modules

We'll use Python's built-in os and hashlib modules, so no additional packages need to be installed ?

import os
import hashlib

How It Works

The algorithm works by:

  • Creating a hashfile() function that generates a unique SHA-1 hash for each file

  • Storing hash values in a dictionary as we scan through files

  • When the same hash appears twice, we've found duplicate files

  • Files with identical hashes have identical content

Creating the Hash Function

First, let's create a function to generate a hash value for any file ?

import os
import hashlib

def hashfile(filename):
    with open(filename, 'rb') as f:
        hasher = hashlib.sha1()
        while True:
            data = f.read(1024)
            if not data:
                break
            hasher.update(data)
        return hasher.hexdigest()

# Test with a sample file (create one first)
with open('test.txt', 'w') as f:
    f.write('Hello World')

print(hashfile('test.txt'))
0a4d55a8d778e5022fab701977c5d840bbc486d0

This function reads the file in 1024-byte chunks and updates the SHA-1 hash for each chunk, returning the final hash as a hexadecimal string.

Finding Duplicates

Now let's create the main function to find duplicate files in a directory ?

import os
import hashlib

def hashfile(filename):
    with open(filename, 'rb') as f:
        hasher = hashlib.sha1()
        while True:
            data = f.read(1024)
            if not data:
                break
            hasher.update(data)
        return hasher.hexdigest()

def find_duplicates(dirname):
    files = os.listdir(dirname)
    if len(files) < 2:
        print("Not enough files to check for duplicates")
        return
    
    hashes = {}
    duplicates_found = False
    
    for filename in files:
        path = os.path.join(dirname, filename)
        if not os.path.isfile(path):
            continue
        
        file_hash = hashfile(path)
        
        if file_hash not in hashes:
            hashes[file_hash] = path
        else:
            print(f'Duplicate found: {path} and {hashes[file_hash]}')
            duplicates_found = True
    
    if not duplicates_found:
        print("No duplicates found")

# Create some test files
with open('file1.txt', 'w') as f:
    f.write('Same content')
    
with open('file2.txt', 'w') as f:
    f.write('Same content')
    
with open('file3.txt', 'w') as f:
    f.write('Different content')

# Find duplicates in current directory
find_duplicates('.')
Duplicate found: ./file2.txt and ./file1.txt

Complete Script

Here's the complete duplicate file finder with error handling ?

import os
import hashlib

def hashfile(filename):
    """Generate SHA-1 hash for a file"""
    try:
        with open(filename, 'rb') as f:
            hasher = hashlib.sha1()
            while True:
                data = f.read(1024)
                if not data:
                    break
                hasher.update(data)
            return hasher.hexdigest()
    except IOError:
        print(f"Error reading file: {filename}")
        return None

def find_duplicates(dirname):
    """Find duplicate files in a directory"""
    try:
        files = os.listdir(dirname)
    except OSError:
        print(f"Error accessing directory: {dirname}")
        return
    
    if len(files) < 2:
        print("Not enough files to check for duplicates")
        return
    
    hashes = {}
    duplicates_found = False
    
    for filename in files:
        path = os.path.join(dirname, filename)
        
        # Skip directories and non-readable files
        if not os.path.isfile(path):
            continue
        
        file_hash = hashfile(path)
        if file_hash is None:
            continue
        
        if file_hash not in hashes:
            hashes[file_hash] = path
        else:
            print(f'Duplicate found: {path} and {hashes[file_hash]}')
            duplicates_found = True
    
    if not duplicates_found:
        print("No duplicates found")

if __name__ == '__main__':
    directory = input("Enter directory path (or '.' for current): ").strip()
    if not directory:
        directory = '.'
    find_duplicates(directory)

Key Features

  • SHA-1 Hashing: Creates unique fingerprints for files

  • Memory Efficient: Reads files in small chunks

  • Error Handling: Handles file access errors gracefully

  • Directory Filtering: Only processes actual files, not directories

Conclusion

This Python script efficiently identifies duplicate files by comparing SHA-1 hash values. The approach is memory-efficient and can handle large files by processing them in chunks rather than loading entire files into memory.

Updated on: 2026-03-27T13:05:09+05:30

2K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements