Article Categories

Selected Reading

Finding Duplicate Files with Python

Python Server Side Programming Programming

Duplicate files consume unnecessary storage space on our computers. Finding and removing them manually can be time-consuming, but we can automate this process with Python using file hashing techniques.

Required Modules

We'll use Python's built-in os and hashlib modules, so no additional packages need to be installed ?

import os
import hashlib

How It Works

The algorithm works by:

Creating a hashfile() function that generates a unique SHA-1 hash for each file
Storing hash values in a dictionary as we scan through files
When the same hash appears twice, we've found duplicate files
Files with identical hashes have identical content

Creating the Hash Function

First, let's create a function to generate a hash value for any file ?

import os
import hashlib

def hashfile(filename):
    with open(filename, 'rb') as f:
        hasher = hashlib.sha1()
        while True:
            data = f.read(1024)
            if not data:
                break
            hasher.update(data)
        return hasher.hexdigest()

# Test with a sample file (create one first)
with open('test.txt', 'w') as f:
    f.write('Hello World')

print(hashfile('test.txt'))

0a4d55a8d778e5022fab701977c5d840bbc486d0

This function reads the file in 1024-byte chunks and updates the SHA-1 hash for each chunk, returning the final hash as a hexadecimal string.

Finding Duplicates

Now let's create the main function to find duplicate files in a directory ?

import os
import hashlib

def hashfile(filename):
    with open(filename, 'rb') as f:
        hasher = hashlib.sha1()
        while True:
            data = f.read(1024)
            if not data:
                break
            hasher.update(data)
        return hasher.hexdigest()

def find_duplicates(dirname):
    files = os.listdir(dirname)
    if len(files) < 2:
        print("Not enough files to check for duplicates")
        return
    
    hashes = {}
    duplicates_found = False
    
    for filename in files:
        path = os.path.join(dirname, filename)
        if not os.path.isfile(path):
            continue
        
        file_hash = hashfile(path)
        
        if file_hash not in hashes:
            hashes[file_hash] = path
        else:
            print(f'Duplicate found: {path} and {hashes[file_hash]}')
            duplicates_found = True
    
    if not duplicates_found:
        print("No duplicates found")

# Create some test files
with open('file1.txt', 'w') as f:
    f.write('Same content')
    
with open('file2.txt', 'w') as f:
    f.write('Same content')
    
with open('file3.txt', 'w') as f:
    f.write('Different content')

# Find duplicates in current directory
find_duplicates('.')

Duplicate found: ./file2.txt and ./file1.txt

Complete Script

Here's the complete duplicate file finder with error handling ?

import os
import hashlib

def hashfile(filename):
    """Generate SHA-1 hash for a file"""
    try:
        with open(filename, 'rb') as f:
            hasher = hashlib.sha1()
            while True:
                data = f.read(1024)
                if not data:
                    break
                hasher.update(data)
            return hasher.hexdigest()
    except IOError:
        print(f"Error reading file: {filename}")
        return None

def find_duplicates(dirname):
    """Find duplicate files in a directory"""
    try:
        files = os.listdir(dirname)
    except OSError:
        print(f"Error accessing directory: {dirname}")
        return
    
    if len(files) < 2:
        print("Not enough files to check for duplicates")
        return
    
    hashes = {}
    duplicates_found = False
    
    for filename in files:
        path = os.path.join(dirname, filename)
        
        # Skip directories and non-readable files
        if not os.path.isfile(path):
            continue
        
        file_hash = hashfile(path)
        if file_hash is None:
            continue
        
        if file_hash not in hashes:
            hashes[file_hash] = path
        else:
            print(f'Duplicate found: {path} and {hashes[file_hash]}')
            duplicates_found = True
    
    if not duplicates_found:
        print("No duplicates found")

if __name__ == '__main__':
    directory = input("Enter directory path (or '.' for current): ").strip()
    if not directory:
        directory = '.'
    find_duplicates(directory)

Key Features

SHA-1 Hashing: Creates unique fingerprints for files
Memory Efficient: Reads files in small chunks
Error Handling: Handles file access errors gracefully
Directory Filtering: Only processes actual files, not directories

Conclusion

This Python script efficiently identifies duplicate files by comparing SHA-1 hash values. The approach is memory-efficient and can handle large files by processing them in chunks rather than loading entire files into memory.

Atharva Shah

Updated on: 2026-03-27T13:05:09+05:30

3K+ Views

Previous Next