Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Finding Duplicate Files with Python
Duplicate files consume unnecessary storage space on our computers. Finding and removing them manually can be time-consuming, but we can automate this process with Python using file hashing techniques.
Required Modules
We'll use Python's built-in os and hashlib modules, so no additional packages need to be installed ?
import os import hashlib
How It Works
The algorithm works by:
Creating a hashfile() function that generates a unique SHA-1 hash for each file
Storing hash values in a dictionary as we scan through files
When the same hash appears twice, we've found duplicate files
Files with identical hashes have identical content
Creating the Hash Function
First, let's create a function to generate a hash value for any file ?
import os
import hashlib
def hashfile(filename):
with open(filename, 'rb') as f:
hasher = hashlib.sha1()
while True:
data = f.read(1024)
if not data:
break
hasher.update(data)
return hasher.hexdigest()
# Test with a sample file (create one first)
with open('test.txt', 'w') as f:
f.write('Hello World')
print(hashfile('test.txt'))
0a4d55a8d778e5022fab701977c5d840bbc486d0
This function reads the file in 1024-byte chunks and updates the SHA-1 hash for each chunk, returning the final hash as a hexadecimal string.
Finding Duplicates
Now let's create the main function to find duplicate files in a directory ?
import os
import hashlib
def hashfile(filename):
with open(filename, 'rb') as f:
hasher = hashlib.sha1()
while True:
data = f.read(1024)
if not data:
break
hasher.update(data)
return hasher.hexdigest()
def find_duplicates(dirname):
files = os.listdir(dirname)
if len(files) < 2:
print("Not enough files to check for duplicates")
return
hashes = {}
duplicates_found = False
for filename in files:
path = os.path.join(dirname, filename)
if not os.path.isfile(path):
continue
file_hash = hashfile(path)
if file_hash not in hashes:
hashes[file_hash] = path
else:
print(f'Duplicate found: {path} and {hashes[file_hash]}')
duplicates_found = True
if not duplicates_found:
print("No duplicates found")
# Create some test files
with open('file1.txt', 'w') as f:
f.write('Same content')
with open('file2.txt', 'w') as f:
f.write('Same content')
with open('file3.txt', 'w') as f:
f.write('Different content')
# Find duplicates in current directory
find_duplicates('.')
Duplicate found: ./file2.txt and ./file1.txt
Complete Script
Here's the complete duplicate file finder with error handling ?
import os
import hashlib
def hashfile(filename):
"""Generate SHA-1 hash for a file"""
try:
with open(filename, 'rb') as f:
hasher = hashlib.sha1()
while True:
data = f.read(1024)
if not data:
break
hasher.update(data)
return hasher.hexdigest()
except IOError:
print(f"Error reading file: {filename}")
return None
def find_duplicates(dirname):
"""Find duplicate files in a directory"""
try:
files = os.listdir(dirname)
except OSError:
print(f"Error accessing directory: {dirname}")
return
if len(files) < 2:
print("Not enough files to check for duplicates")
return
hashes = {}
duplicates_found = False
for filename in files:
path = os.path.join(dirname, filename)
# Skip directories and non-readable files
if not os.path.isfile(path):
continue
file_hash = hashfile(path)
if file_hash is None:
continue
if file_hash not in hashes:
hashes[file_hash] = path
else:
print(f'Duplicate found: {path} and {hashes[file_hash]}')
duplicates_found = True
if not duplicates_found:
print("No duplicates found")
if __name__ == '__main__':
directory = input("Enter directory path (or '.' for current): ").strip()
if not directory:
directory = '.'
find_duplicates(directory)
Key Features
SHA-1 Hashing: Creates unique fingerprints for files
Memory Efficient: Reads files in small chunks
Error Handling: Handles file access errors gracefully
Directory Filtering: Only processes actual files, not directories
Conclusion
This Python script efficiently identifies duplicate files by comparing SHA-1 hash values. The approach is memory-efficient and can handle large files by processing them in chunks rather than loading entire files into memory.
