Check if two PDF documents are identical with Python


PDF files are widely used for sharing documents, and it's often essential to check if two PDF files are identical or not. There are various methods to compare PDF files, and Python offers several libraries to achieve this task. In this article, we'll discuss various methods to check if the PDF is identical in Python.

Method 1: Using PyPDF2

PyPDF2 is a Python library that can manipulate PDF files. It provides several methods to extract data from a PDF file, including text, images, and metadata. PyPDF2 also supports merging, splitting, and encrypting PDF files. We can use PyPDF2 to compare two PDF files by iterating over their pages and comparing their contents.

Example

The following code snippet demonstrates how to use PyPDF2 to compare two PDF files −

import PyPDF2
def compare_pdfs(file1, file2):
   pdf1 = PyPDF2.PdfReader(open(file1, "rb"))
   pdf2 = PyPDF2.PdfReader(open(file2, "rb"))
   if pdf1.getNumPages() != pdf2.getNumPages():
      return False
   for i in range(pdf1.getNumPages()):
      page1 = pdf1.getPage(i)
      page2 = pdf2.getPage(i)
      if page1.extract_text() != page2.extract_text():
         return False
   return True
if __name__ == '__main__':
   file1 = "pdf1.pdf"
   file2 = "pdf2.pdf"
   if compare_pdfs(file1, file2):
      print("PDFs are identical")
   else:
      print("PDFs are not identical") 

In the above code, we define a compare_pdfs function that takes two PDF files as arguments. We use the PdfReader method of PyPDF2 to read the PDF files and compare their page contents using the extract_text method.

If the number of pages in the PDF files is not the same, we immediately return False, indicating that the PDFs are not identical. Otherwise, we compare the contents of each page in the two PDF files. If the contents of any two pages are not the same, we return False. Otherwise, we return True, indicating that the PDFs are identical.

Method 2: Using pdftotext

pdftotext is a command-line utility that converts PDF files to plain text. We can use pdftotext to extract the text from two PDF files and compare them using Python. This method requires pdftotext to be installed on the system.

Example

The following code snippet demonstrates how to use pdftotext to compare two PDF files −

import os
import subprocess
def compare_pdfs(file1, file2):
   temp_file1 = "temp_file1.txt"
   temp_file2 = "temp_file2.txt"
   subprocess.call(['pdftotext', file1, temp_file1])
   subprocess.call(['pdftotext', file2, temp_file2])
   with open(temp_file1, "r") as f1, open(temp_file2, "r") as f2:
       if f1.read() == f2.read():
          return True
       else:
          return False
   os.remove(temp_file1)
   os.remove(temp_file2)
if __name__ == '__main__':
   file1 = "pdf1.pdf"
   file2 = "pdf2.pdf"
   if compare_pdfs(file1, file2):
      print("PDFs are identical")
   else:
      print("PDFs are not identical") 

In the above code, we define a compare_pdfs function that takes two PDF files as arguments. We use the subprocess module to call pdftotext command-line utility with the help of call() function, which takes two arguments. The first argument is the command-line command to be executed, and the second argument is a list of arguments passed to the command.

The pdftotext command-line utility converts the PDF files to plain text files, and we store them in temporary files temp_file1.txt and temp_file2.txt. We then compare the contents of these temporary files using Python's open() function and the read() method. If the contents of the two files are the same, we return True, indicating that the PDFs are identical. Otherwise, we return False.

Finally, we remove the temporary files using the os.remove() function.

Method 3: Using difflib

difflib is a Python library that provides a set of tools for comparing sequences. We can use difflib to compare the text extracted from two PDF files and determine the differences between them.

Example

The following code snippet demonstrates how to use difflib to compare two PDF files −

import difflib
import PyPDF2
def compare_pdfs(file1, file2):
   pdf1 = PyPDF2.PdfReader(open(file1, "rb"))
   pdf2 = PyPDF2.PdfReader(open(file2, "rb"))
   if pdf1.getNumPages() != pdf2.getNumPages():
      return False
   for i in range(pdf1.getNumPages()):
      page1 = pdf1.getPage(i)
      page2 = pdf2.getPage(i)
      text1 = page1.extract_text().splitlines()
      text2 = page2.extract_text().splitlines()
      diff = difflib.ndiff(text1, text2)
      if any(line.startswith("+ ") or line.startswith("- ") for line in diff):
         return False
   return True
if __name__ == '__main__':
   file1 = "pdf1.pdf"
   file2 = "pdf2.pdf"
   if compare_pdfs(file1, file2):
      print("PDFs are identical")
   else:
      print("PDFs are not identical") 

In the above code, we define a compare_pdfs function that takes two PDF files as arguments. We use PyPDF2 to read the PDF files and extract the text from each page using the extract_text method. We split the extracted text into lines using the splitlines method.

We then use difflib to compare the lines of text extracted from the two PDF files. We use the ndiff function of difflib to generate the differences between the two lists of lines. If there are any lines in the differences list that start with "+" or "-", we know that the PDFs are not identical and return False.

If the differences list does not contain any lines that start with "+" or "-", we know that the PDFs are identical and return True.

Method 4: Using pdftk

pdftk is a command-line utility that can manipulate PDF files. We can use pdftk to compare two PDF files and determine if they are identical. This method requires pdftk to be installed on the system.

Example

The following code snippet demonstrates how to use pdftk to compare two PDF files −

import os
import subprocess
def compare_pdfs(file1, file2):
   output = subprocess.check_output(['pdftk', file1, 'diff', file2])
   if "input files are identical" in output.decode():
      return True
   else:
      return False
if __name__ == '__main__':
   file1 = "pdf1.pdf"
   file2 = "pdf2.pdf"
   if compare_pdfs(file1, file2):
      print("PDFs are identical")
  else:
      print("PDFs are not identical")

In the above code, we define a compare_pdfs function that takes two PDF files as arguments. We use the subprocess module to call the pdftk command-line utility with the check_output() function. We pass the `diff` command to pdftk, along with the two PDF files to compare.

If the two PDF files are identical, pdftk returns the message "input files are identical" in its output. We check for this message in the output using Python's `in` keyword. If the message is present, we know that the PDF files are identical and return `True`. Otherwise, we return False.

Conclusion

In this tutorial, we discussed various methods to check if two PDF files are identical in Python. We discussed four methods −

  • Using PyPDF2 to compare the text of two PDF files

  • Using pdftotext to compare the text of two PDF files

  • Using difflib to compare the text of two PDF files

  • Using pdftk to compare two PDF files

All these methods are reliable and effective. However, the choice of method depends on the requirements of the project and the specific use case. If the PDF files are large, and only a few pages need to be compared, then using PyPDF2 or pdftotext may be more efficient than using pdftk. On the other hand, if pdftk is already installed on the system, it may be the easiest and most convenient method to use.

Updated on: 22-Feb-2024

15 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements