How to Compare two Dataframe with Pandas Compare?

If you work with data analysis or data science, then you already know the importance of comparing DataFrames. Fortunately, the Python library, pandas, offers a handy compare() method that allows you to compare two DataFrames and highlight their differences. This method is incredibly useful for identifying discrepancies between datasets and making informed decisions based on those differences.

In this article, we will explore how to use pandas compare() to compare two DataFrames and dive into some of the customization options available. Whether you're an experienced data analyst or a beginner, this article will provide you with the knowledge you need to use compare() effectively.

Syntax

The basic syntax for the compare() function is as follows ?

df1.compare(df2, **kwargs)

Where df1 and df2 are the two DataFrames that we want to compare. The **kwargs argument allows for various options and parameters to be passed to the function.

Basic Example

Let's start with a simple example. Suppose we have two DataFrames with the same columns but different values ?

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1, 2, 4], 'B': [4, 5, 7]})

print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)

# Compare the DataFrames
comparison = df1.compare(df2)
print("\nComparison result:")
print(comparison)
DataFrame 1:
   A  B
0  1  4
1  2  5
2  3  6

DataFrame 2:
   A  B
0  1  4
1  2  5
2  4  7

Comparison result:
   A     B    
self other self other
2     3     4     6     7

The result displays differences between the two DataFrames. The self column shows values from df1 and the other column shows values from df2. Only rows with differences are shown (row 2 in this case).

Parameters and Options

The compare() function has several parameters that allow for more flexibility in comparing DataFrames.

Using keep_shape Parameter

The keep_shape parameter controls whether the output should maintain the original shape. By default, it's False, showing only rows with differences ?

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1, 2, 4], 'B': [4, 5, 7]})

# Keep original shape
comparison = df1.compare(df2, keep_shape=True)
print("With keep_shape=True:")
print(comparison)
With keep_shape=True:
     A           B    
  self other  self other
0  NaN   NaN   NaN   NaN
1  NaN   NaN   NaN   NaN
2  3.0   4.0   6.0   7.0

Using keep_equal Parameter

The keep_equal parameter determines whether equal values should be shown in the output. By default, it's False ?

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1, 2, 4], 'B': [4, 5, 7]})

# Show equal values too
comparison = df1.compare(df2, keep_equal=True)
print("With keep_equal=True:")
print(comparison)
With keep_equal=True:
   A     B    
self other self other
0     1     1     4     4
1     2     2     5     5
2     3     4     6     7

Combining Parameters

You can use multiple parameters together for more control over the output ?

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1, 2, 4], 'B': [4, 5, 7]})

# Combine parameters
comparison = df1.compare(df2, keep_shape=True, keep_equal=True)
print("With both parameters:")
print(comparison)
With both parameters:
   A     B    
self other self other
0     1     1     4     4
1     2     2     5     5
2     3     4     6     7

Practical Use Cases

The compare() method is particularly useful for data validation, version control of datasets, and identifying changes between different data sources ?

import pandas as pd

# Example: Comparing sales data before and after updates
original_data = pd.DataFrame({
    'Product': ['A', 'B', 'C'],
    'Price': [100, 200, 150],
    'Stock': [50, 30, 75]
})

updated_data = pd.DataFrame({
    'Product': ['A', 'B', 'C'],
    'Price': [105, 200, 140],
    'Stock': [45, 30, 80]
})

changes = original_data.compare(updated_data)
print("Changes detected:")
print(changes)
Changes detected:
   Price     Stock    
    self other  self other
0    100   105    50    45
2    150   140    75    80

Comparison Table

Parameter Default Description
keep_shape False Keep original DataFrame shape in output
keep_equal False Show equal values in comparison result
result_names ('self', 'other') Names for the compared DataFrames in output

Conclusion

The pandas compare() method is a powerful tool for identifying differences between DataFrames. Use it with appropriate parameters to customize the output based on your specific needs, whether for data validation, change detection, or quality control processes.

Updated on: 2026-03-27T09:07:14+05:30

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements