How to find difference between 2 files in Python?


When you venture into the realm of file processing, the need to discern disparities and differences between two files arises frequently. Python equips us with an array of potent tools to accomplish this task with ease and precision. In this article, we shall navigate a few distinct methodologies to reveal the differences between two files in Python. Each approach is equipped with unique functionalities and adaptability, granting you the ability to seamlessly compare files of varying sizes and formats. As a Python coding enthusiast, you are walked through each method, providing stepwise explanations in a style that is human-friendly. By the culmination of this article, you shall be armed with the knowledge to conduct file comparisons with confidence and accuracy. So, let us embark on this adventure and learn the intricacies of file comparison in Python!

Understanding the Significance of File Comparison in Python

Before we start examining the code examples, it serves us well to comprehend the significance of file comparison in Python. File comparison enables us to discern alterations, similarities, and disparities between two files. This process stands as a vital aspect of version control, data validation, and detecting changes in configuration files.

Harnessing the filecmp Module

Our inaugural approach showcases file comparison through the employment of the filecmp module, which is known for its efficient and straightforward functionality.

Example

import filecmp

file1 = "file1.txt"
file2 = "file2.txt"

comparison = filecmp.cmp(file1, file2)
print("Files are the same:", comparison)

Output

For some files, the following was the output

Files are the same: False

In this example, we invoke the filecmp module, a known repository of functions for comparing files. The paths of file1.txt and file2.txt are known via the file1 and file2 variables, respectively. The filecmp.cmp() function goes on to, compare the contents of the two files. If the files prove identical, the function yields True; otherwise, it returns False. The console helps by displaying the outcome of the comparison.

Embracing the difflib Module

Our second example explains file comparison through the difflib module, a home of tools for comparing sequences and files.

Example

import difflib

file1 = "file1.txt"
file2 = "file2.txt"

with open(file1, "r") as f1, open(file2, "r") as f2:
   diff = difflib.unified_diff(f1.readlines(), 
f2.readlines(), fromfile=file1, tofile=file2)

for line in diff:
   print(line)

Output

For some files, the following was the output

--- file1.txt
+++ file2.txt
@@ -1 +1 @@
-hello
+hi

Here, we import the difflib module, a storehouse of functions for comparing sequences, including lines in text files. We invoke the open() function, reading the contents of both files, and reserving them within distinct variables, f1 and f2. The difflib.unified_diff() function commences its work, divulging the differences between the files. The fromfile and tofile parameters enable us to specify the file names in the output. We effectively determine the differences and display them in the console.

Mastering the hashlib Module

The third method teaches techniques in file comparison through the hashlib module, the precursor of SHA-1 hash values.

Example

import hashlib

def file_hash(filename):
   sha1_hash = hashlib.sha1()
   with open(filename, "rb") as f:
     while chunk := f.read(8192):
       sha1_hash.update(chunk)
   return sha1_hash.hexdigest()

file1 = "path/to/first/file"
file2 = "path/to/second/file"

hash1 = file_hash(file1)
hash2 = file_hash(file2)

comparison = (hash1 == hash2)
print("Files are the same:", comparison)

Output

For some files, the following was the output

Files are the same: False

In this example, the hashlib module makes available to us cryptographic hash functions. A function, file_hash(), goes on, accepting a file name and returning its SHA-1 hash value. The files are read in binary mode, gradually updating the hash object with data chunks. We compare the hash values of the two files using the == operator, the output being displayed on the console.

Unveiling the fileinput Module

Our next example shows use cases of file comparison, by using the fileinput module, simplifying the handling of multiple files.

Example

import fileinput

file1 = "file1.txt"
file2 = "file2.txt"

for line1, line2 in zip(fileinput.input(file1), 
fileinput.input(file2)):
   if line1 != line2:
     print(f"File1: {line1.strip()}\nFile2: {line2.strip()}\n")

Output

For some files file1.txt and file2.txt, the following was the output

File1: hello
File2: hi

In this demonstration, the fileinput module makes it possible for simplifying the reading of multiple files line by line. The zip() function proceeds to synchronously iterate through the corresponding lines of both files. In case a disparity is found between the lines, we redirect the lines from both files to the console. The strip() method ensures the lines are rid of any unwanted leading or trailing whitespace characters.

Utilizing the difflib.HtmlDiff Class

In the last and final example, we accomplish file comparison through the difflib.HtmlDiff class, a creator of HTML-formatted differences.

Example

import difflib

file1 = "file1.txt"
file2 = "file2.txt"

with open(file1, "r") as f1, open(file2, "r") as f2:
   diff = difflib.HtmlDiff().make_file(f1.readlines(), 
f2.readlines(), fromdesc=file1, todesc=file2)

with open("diff.html", "w") as html_file:
   html_file.write(diff)

In this example, the difflib module finds its place once more, as we deploy the HtmlDiff class to create an HTML-formatted comparison. The contents of both files are invoked through the open() function, assigned to separate variables, f1 and f2. The difflib.HtmlDiff().make_file() takes center stage, defining the file differences as HTML, and providing file descriptions by using the fromdesc and todesc parameters. The resulting HTML comparison is written to a file named "diff.html".

In short, the skill of comparing two files in Python is a valuable asset for any developer working with file processing and data validation. Throughout this article, we have traversed through a few diverse examples to determine the differences between files, each harboring its own set of unique strengths and purposes.

Be it through the filecmp module, difflib module, hashlib module, fileinput module, or the creation of HTML-formatted differences with difflib.HtmlDiff, each method bestows upon you the power to compare files effortlessly and detect discrepancies with unyielding precision.

As your Python journey persists, use your newly acquired skill of file comparison to streamline data validation, version control, and file processing. May these file comparison techniques enhance your Python expertise to new heights, as you build robust and efficient file-handling applications.

Updated on: 03-Aug-2023

7K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements