Article Categories

Selected Reading

Check if two PDF documents are identical with Python

Python Server Side Programming Programming

PDF files are widely used for sharing documents, and it's often essential to check if two PDF files are identical. Python offers several libraries and methods to achieve this comparison. In this article, we'll explore various approaches to determine if two PDF documents contain the same content.

Using PyPDF2 for Text Comparison

PyPDF2 is a popular Python library for PDF manipulation. It can extract text, images, and metadata from PDF files. We can compare PDFs by extracting and comparing their text content page by page.

Example

The following code demonstrates how to compare two PDF files using PyPDF2 ?

import PyPDF2

def compare_pdfs_pypdf2(file1, file2):
    try:
        with open(file1, "rb") as f1, open(file2, "rb") as f2:
            pdf1 = PyPDF2.PdfReader(f1)
            pdf2 = PyPDF2.PdfReader(f2)
            
            # Compare number of pages
            if len(pdf1.pages) != len(pdf2.pages):
                return False
            
            # Compare each page content
            for i in range(len(pdf1.pages)):
                page1_text = pdf1.pages[i].extract_text()
                page2_text = pdf2.pages[i].extract_text()
                
                if page1_text != page2_text:
                    return False
            
            return True
    except Exception as e:
        print(f"Error reading PDF files: {e}")
        return False

# Example usage
if __name__ == '__main__':
    file1 = "document1.pdf"
    file2 = "document2.pdf"
    
    if compare_pdfs_pypdf2(file1, file2):
        print("PDFs are identical")
    else:
        print("PDFs are not identical")

This function first checks if both PDFs have the same number of pages. If not, they're immediately considered different. Then it extracts text from each corresponding page and compares them. The function includes error handling for cases where PDF files cannot be read.

Using File Hash Comparison

The most reliable method to check if two files are completely identical is comparing their hash values. This approach detects any differences in the binary content.

Example

Here's how to compare PDFs using hash values ?

import hashlib

def compare_pdfs_hash(file1, file2):
    """Compare two PDF files using MD5 hash"""
    def get_file_hash(filename):
        hash_md5 = hashlib.md5()
        try:
            with open(filename, "rb") as f:
                # Read file in chunks to handle large files efficiently
                for chunk in iter(lambda: f.read(4096), b""):
                    hash_md5.update(chunk)
            return hash_md5.hexdigest()
        except FileNotFoundError:
            print(f"File {filename} not found")
            return None
        except Exception as e:
            print(f"Error reading {filename}: {e}")
            return None
    
    hash1 = get_file_hash(file1)
    hash2 = get_file_hash(file2)
    
    if hash1 is None or hash2 is None:
        return False
    
    return hash1 == hash2

# Test with sample comparison
# Note: This creates dummy files for demonstration
import tempfile

# Create two identical temporary files
content = b"Sample PDF content for testing"
with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as f1:
    f1.write(content)
    file1_path = f1.name

with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as f2:
    f2.write(content)
    file2_path = f2.name

result = compare_pdfs_hash(file1_path, file2_path)
print(f"Files are identical: {result}")

# Cleanup
import os
os.unlink(file1_path)
os.unlink(file2_path)

Files are identical: True

Using difflib for Detailed Text Comparison

The difflib library provides tools for comparing sequences and can show detailed differences between PDF texts. This method is useful when you need to understand what differs between documents.

Example

This example uses difflib to detect and analyze differences ?

import difflib
import PyPDF2

def compare_pdfs_difflib(file1, file2):
    """Compare PDFs using difflib for detailed analysis"""
    try:
        with open(file1, "rb") as f1, open(file2, "rb") as f2:
            pdf1 = PyPDF2.PdfReader(f1)
            pdf2 = PyPDF2.PdfReader(f2)
            
            if len(pdf1.pages) != len(pdf2.pages):
                print(f"Different number of pages: {len(pdf1.pages)} vs {len(pdf2.pages)}")
                return False
            
            for i in range(len(pdf1.pages)):
                text1 = pdf1.pages[i].extract_text().splitlines()
                text2 = pdf2.pages[i].extract_text().splitlines()
                
                # Generate differences
                diff = list(difflib.ndiff(text1, text2))
                
                # Check for any additions or deletions
                has_differences = any(line.startswith('+ ') or line.startswith('- ') 
                                    for line in diff)
                
                if has_differences:
                    print(f"Differences found on page {i + 1}")
                    return False
            
            return True
            
    except Exception as e:
        print(f"Error comparing PDFs: {e}")
        return False

# Example usage would require actual PDF files
print("Function defined successfully - requires actual PDF files to test")

Comparison of Methods

Method	Accuracy	Speed	Use Case
Hash Comparison	100% (binary)	Very Fast	Exact file matching
PyPDF2 Text	Good (content)	Medium	Content comparison
difflib Analysis	Good (detailed)	Slower	Finding specific differences

Key Considerations

When choosing a comparison method, consider these factors:

File Size: Hash comparison is most efficient for large files
Precision Needed: Hash detects any binary difference, text methods focus on content
Error Handling: Always include try-catch blocks for file operations
Metadata: Text-based methods ignore formatting and metadata differences

Conclusion

Use hash comparison for exact binary matching of PDF files, as it's fastest and most reliable. For content-based comparison that ignores formatting differences, PyPDF2 text extraction works well. Choose difflib when you need detailed analysis of what differs between documents.

Gaurav Leekha

Updated on: 2026-03-27T16:41:34+05:30

4K+ Views

Previous Next