Check if two PDF documents are identical with Python

PDF files are widely used for sharing documents, and it's often essential to check if two PDF files are identical. Python offers several libraries and methods to achieve this comparison. In this article, we'll explore various approaches to determine if two PDF documents contain the same content.

Using PyPDF2 for Text Comparison

PyPDF2 is a popular Python library for PDF manipulation. It can extract text, images, and metadata from PDF files. We can compare PDFs by extracting and comparing their text content page by page.

Example

The following code demonstrates how to compare two PDF files using PyPDF2 ?

import PyPDF2

def compare_pdfs_pypdf2(file1, file2):
    try:
        with open(file1, "rb") as f1, open(file2, "rb") as f2:
            pdf1 = PyPDF2.PdfReader(f1)
            pdf2 = PyPDF2.PdfReader(f2)
            
            # Compare number of pages
            if len(pdf1.pages) != len(pdf2.pages):
                return False
            
            # Compare each page content
            for i in range(len(pdf1.pages)):
                page1_text = pdf1.pages[i].extract_text()
                page2_text = pdf2.pages[i].extract_text()
                
                if page1_text != page2_text:
                    return False
            
            return True
    except Exception as e:
        print(f"Error reading PDF files: {e}")
        return False

# Example usage
if __name__ == '__main__':
    file1 = "document1.pdf"
    file2 = "document2.pdf"
    
    if compare_pdfs_pypdf2(file1, file2):
        print("PDFs are identical")
    else:
        print("PDFs are not identical")

This function first checks if both PDFs have the same number of pages. If not, they're immediately considered different. Then it extracts text from each corresponding page and compares them. The function includes error handling for cases where PDF files cannot be read.

Using File Hash Comparison

The most reliable method to check if two files are completely identical is comparing their hash values. This approach detects any differences in the binary content.

Example

Here's how to compare PDFs using hash values ?

import hashlib

def compare_pdfs_hash(file1, file2):
    """Compare two PDF files using MD5 hash"""
    def get_file_hash(filename):
        hash_md5 = hashlib.md5()
        try:
            with open(filename, "rb") as f:
                # Read file in chunks to handle large files efficiently
                for chunk in iter(lambda: f.read(4096), b""):
                    hash_md5.update(chunk)
            return hash_md5.hexdigest()
        except FileNotFoundError:
            print(f"File {filename} not found")
            return None
        except Exception as e:
            print(f"Error reading {filename}: {e}")
            return None
    
    hash1 = get_file_hash(file1)
    hash2 = get_file_hash(file2)
    
    if hash1 is None or hash2 is None:
        return False
    
    return hash1 == hash2

# Test with sample comparison
# Note: This creates dummy files for demonstration
import tempfile

# Create two identical temporary files
content = b"Sample PDF content for testing"
with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as f1:
    f1.write(content)
    file1_path = f1.name

with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as f2:
    f2.write(content)
    file2_path = f2.name

result = compare_pdfs_hash(file1_path, file2_path)
print(f"Files are identical: {result}")

# Cleanup
import os
os.unlink(file1_path)
os.unlink(file2_path)
Files are identical: True

Using difflib for Detailed Text Comparison

The difflib library provides tools for comparing sequences and can show detailed differences between PDF texts. This method is useful when you need to understand what differs between documents.

Example

This example uses difflib to detect and analyze differences ?

import difflib
import PyPDF2

def compare_pdfs_difflib(file1, file2):
    """Compare PDFs using difflib for detailed analysis"""
    try:
        with open(file1, "rb") as f1, open(file2, "rb") as f2:
            pdf1 = PyPDF2.PdfReader(f1)
            pdf2 = PyPDF2.PdfReader(f2)
            
            if len(pdf1.pages) != len(pdf2.pages):
                print(f"Different number of pages: {len(pdf1.pages)} vs {len(pdf2.pages)}")
                return False
            
            for i in range(len(pdf1.pages)):
                text1 = pdf1.pages[i].extract_text().splitlines()
                text2 = pdf2.pages[i].extract_text().splitlines()
                
                # Generate differences
                diff = list(difflib.ndiff(text1, text2))
                
                # Check for any additions or deletions
                has_differences = any(line.startswith('+ ') or line.startswith('- ') 
                                    for line in diff)
                
                if has_differences:
                    print(f"Differences found on page {i + 1}")
                    return False
            
            return True
            
    except Exception as e:
        print(f"Error comparing PDFs: {e}")
        return False

# Example usage would require actual PDF files
print("Function defined successfully - requires actual PDF files to test")

Comparison of Methods

Method Accuracy Speed Use Case
Hash Comparison 100% (binary) Very Fast Exact file matching
PyPDF2 Text Good (content) Medium Content comparison
difflib Analysis Good (detailed) Slower Finding specific differences

Key Considerations

When choosing a comparison method, consider these factors:

  • File Size: Hash comparison is most efficient for large files

  • Precision Needed: Hash detects any binary difference, text methods focus on content

  • Error Handling: Always include try-catch blocks for file operations

  • Metadata: Text-based methods ignore formatting and metadata differences

Conclusion

Use hash comparison for exact binary matching of PDF files, as it's fastest and most reliable. For content-based comparison that ignores formatting differences, PyPDF2 text extraction works well. Choose difflib when you need detailed analysis of what differs between documents.

Updated on: 2026-03-27T16:41:34+05:30

4K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements