Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to Compare two Dataframe with Pandas Compare?
If you work with data analysis or data science, then you already know the importance of comparing DataFrames. Fortunately, the Python library, pandas, offers a handy compare() method that allows you to compare two DataFrames and highlight their differences. This method is incredibly useful for identifying discrepancies between datasets and making informed decisions based on those differences.
In this article, we will explore how to use pandas compare() to compare two DataFrames and dive into some of the customization options available. Whether you're an experienced data analyst or a beginner, this article will provide you with the knowledge you need to use compare() effectively.
Syntax
The basic syntax for the compare() function is as follows ?
df1.compare(df2, **kwargs)
Where df1 and df2 are the two DataFrames that we want to compare. The **kwargs argument allows for various options and parameters to be passed to the function.
Basic Example
Let's start with a simple example. Suppose we have two DataFrames with the same columns but different values ?
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1, 2, 4], 'B': [4, 5, 7]})
print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)
# Compare the DataFrames
comparison = df1.compare(df2)
print("\nComparison result:")
print(comparison)
DataFrame 1: A B 0 1 4 1 2 5 2 3 6 DataFrame 2: A B 0 1 4 1 2 5 2 4 7 Comparison result: A B self other self other 2 3 4 6 7
The result displays differences between the two DataFrames. The self column shows values from df1 and the other column shows values from df2. Only rows with differences are shown (row 2 in this case).
Parameters and Options
The compare() function has several parameters that allow for more flexibility in comparing DataFrames.
Using keep_shape Parameter
The keep_shape parameter controls whether the output should maintain the original shape. By default, it's False, showing only rows with differences ?
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1, 2, 4], 'B': [4, 5, 7]})
# Keep original shape
comparison = df1.compare(df2, keep_shape=True)
print("With keep_shape=True:")
print(comparison)
With keep_shape=True:
A B
self other self other
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 3.0 4.0 6.0 7.0
Using keep_equal Parameter
The keep_equal parameter determines whether equal values should be shown in the output. By default, it's False ?
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1, 2, 4], 'B': [4, 5, 7]})
# Show equal values too
comparison = df1.compare(df2, keep_equal=True)
print("With keep_equal=True:")
print(comparison)
With keep_equal=True: A B self other self other 0 1 1 4 4 1 2 2 5 5 2 3 4 6 7
Combining Parameters
You can use multiple parameters together for more control over the output ?
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1, 2, 4], 'B': [4, 5, 7]})
# Combine parameters
comparison = df1.compare(df2, keep_shape=True, keep_equal=True)
print("With both parameters:")
print(comparison)
With both parameters: A B self other self other 0 1 1 4 4 1 2 2 5 5 2 3 4 6 7
Practical Use Cases
The compare() method is particularly useful for data validation, version control of datasets, and identifying changes between different data sources ?
import pandas as pd
# Example: Comparing sales data before and after updates
original_data = pd.DataFrame({
'Product': ['A', 'B', 'C'],
'Price': [100, 200, 150],
'Stock': [50, 30, 75]
})
updated_data = pd.DataFrame({
'Product': ['A', 'B', 'C'],
'Price': [105, 200, 140],
'Stock': [45, 30, 80]
})
changes = original_data.compare(updated_data)
print("Changes detected:")
print(changes)
Changes detected:
Price Stock
self other self other
0 100 105 50 45
2 150 140 75 80
Comparison Table
| Parameter | Default | Description |
|---|---|---|
keep_shape |
False | Keep original DataFrame shape in output |
keep_equal |
False | Show equal values in comparison result |
result_names |
('self', 'other') | Names for the compared DataFrames in output |
Conclusion
The pandas compare() method is a powerful tool for identifying differences between DataFrames. Use it with appropriate parameters to customize the output based on your specific needs, whether for data validation, change detection, or quality control processes.
