Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Check if two PDF documents are identical with Python
PDF files are widely used for sharing documents, and it's often essential to check if two PDF files are identical. Python offers several libraries and methods to achieve this comparison. In this article, we'll explore various approaches to determine if two PDF documents contain the same content.
Using PyPDF2 for Text Comparison
PyPDF2 is a popular Python library for PDF manipulation. It can extract text, images, and metadata from PDF files. We can compare PDFs by extracting and comparing their text content page by page.
Example
The following code demonstrates how to compare two PDF files using PyPDF2 ?
import PyPDF2
def compare_pdfs_pypdf2(file1, file2):
try:
with open(file1, "rb") as f1, open(file2, "rb") as f2:
pdf1 = PyPDF2.PdfReader(f1)
pdf2 = PyPDF2.PdfReader(f2)
# Compare number of pages
if len(pdf1.pages) != len(pdf2.pages):
return False
# Compare each page content
for i in range(len(pdf1.pages)):
page1_text = pdf1.pages[i].extract_text()
page2_text = pdf2.pages[i].extract_text()
if page1_text != page2_text:
return False
return True
except Exception as e:
print(f"Error reading PDF files: {e}")
return False
# Example usage
if __name__ == '__main__':
file1 = "document1.pdf"
file2 = "document2.pdf"
if compare_pdfs_pypdf2(file1, file2):
print("PDFs are identical")
else:
print("PDFs are not identical")
This function first checks if both PDFs have the same number of pages. If not, they're immediately considered different. Then it extracts text from each corresponding page and compares them. The function includes error handling for cases where PDF files cannot be read.
Using File Hash Comparison
The most reliable method to check if two files are completely identical is comparing their hash values. This approach detects any differences in the binary content.
Example
Here's how to compare PDFs using hash values ?
import hashlib
def compare_pdfs_hash(file1, file2):
"""Compare two PDF files using MD5 hash"""
def get_file_hash(filename):
hash_md5 = hashlib.md5()
try:
with open(filename, "rb") as f:
# Read file in chunks to handle large files efficiently
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
except FileNotFoundError:
print(f"File {filename} not found")
return None
except Exception as e:
print(f"Error reading {filename}: {e}")
return None
hash1 = get_file_hash(file1)
hash2 = get_file_hash(file2)
if hash1 is None or hash2 is None:
return False
return hash1 == hash2
# Test with sample comparison
# Note: This creates dummy files for demonstration
import tempfile
# Create two identical temporary files
content = b"Sample PDF content for testing"
with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as f1:
f1.write(content)
file1_path = f1.name
with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as f2:
f2.write(content)
file2_path = f2.name
result = compare_pdfs_hash(file1_path, file2_path)
print(f"Files are identical: {result}")
# Cleanup
import os
os.unlink(file1_path)
os.unlink(file2_path)
Files are identical: True
Using difflib for Detailed Text Comparison
The difflib library provides tools for comparing sequences and can show detailed differences between PDF texts. This method is useful when you need to understand what differs between documents.
Example
This example uses difflib to detect and analyze differences ?
import difflib
import PyPDF2
def compare_pdfs_difflib(file1, file2):
"""Compare PDFs using difflib for detailed analysis"""
try:
with open(file1, "rb") as f1, open(file2, "rb") as f2:
pdf1 = PyPDF2.PdfReader(f1)
pdf2 = PyPDF2.PdfReader(f2)
if len(pdf1.pages) != len(pdf2.pages):
print(f"Different number of pages: {len(pdf1.pages)} vs {len(pdf2.pages)}")
return False
for i in range(len(pdf1.pages)):
text1 = pdf1.pages[i].extract_text().splitlines()
text2 = pdf2.pages[i].extract_text().splitlines()
# Generate differences
diff = list(difflib.ndiff(text1, text2))
# Check for any additions or deletions
has_differences = any(line.startswith('+ ') or line.startswith('- ')
for line in diff)
if has_differences:
print(f"Differences found on page {i + 1}")
return False
return True
except Exception as e:
print(f"Error comparing PDFs: {e}")
return False
# Example usage would require actual PDF files
print("Function defined successfully - requires actual PDF files to test")
Comparison of Methods
| Method | Accuracy | Speed | Use Case |
|---|---|---|---|
| Hash Comparison | 100% (binary) | Very Fast | Exact file matching |
| PyPDF2 Text | Good (content) | Medium | Content comparison |
| difflib Analysis | Good (detailed) | Slower | Finding specific differences |
Key Considerations
When choosing a comparison method, consider these factors:
File Size: Hash comparison is most efficient for large files
Precision Needed: Hash detects any binary difference, text methods focus on content
Error Handling: Always include try-catch blocks for file operations
Metadata: Text-based methods ignore formatting and metadata differences
Conclusion
Use hash comparison for exact binary matching of PDF files, as it's fastest and most reliable. For content-based comparison that ignores formatting differences, PyPDF2 text extraction works well. Choose difflib when you need detailed analysis of what differs between documents.
