How to Compare two Dataframe with Pandas Compare?


If you work with data analysis or data science, then you already know the importance of comparing DataFrames. Fortunately, the Python library, pandas, offers a handy " compare " method that allows you to compare two DataFrames and highlight their differences. This method is incredibly useful for identifying discrepancies between sets of data and making informed decisions based on those differences.

In this article, we will explore how to use pandas compare to compare two DataFrames and dive into some of the customization options available. Whether you're an experienced data analyst or a fresher, this article will provide you with the knowledge you need for pandas to use compare effectively and confidently.

Basic Syntax

The basic syntax for the compare() function is as follows:

df1.compare(df2, **kwargs)

where df1 and df2 are the two dataframes that we want to compare. The **kwargs argument allows for various options and parameters to be passed to the function.

Example

Let's start with a simple example. Suppose we have two dataframes, df1 and df2, as shown below:

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1, 2, 4], 'B': [4, 5, 7]})

The two dataframes have the same columns but different values. We can use the compare() function to compare these two dataframes as follows:

comparison = df1.compare(df2)
print(comparison)

Output

This will output the following result:

   A  B     
  -3  6  7
  +4  0  1

The result displays variances between the two dataframes. The entries of −3 and 6 beneath column A signify that the value in df2 is 3 less than df1, while the entries of +4 and 0 below column A indicate that the value in df2 is 4 greater than df1. Correspondingly, the entries of 7 and 1 under column B indicate that the value in df2 is 1 greater than df1.

Example

Here's a one more example of comparing two dataframes using the pandas compare() function:

import pandas as pd

# Create two dataframes to compare
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1, 2, 4], 'B': [4, 5, 7]})

# Compare the two dataframes using the compare() function
comparison = df1.compare(df2)

# Print the comparison
print(comparison)

Output

This will output the following result:

   A      
   2  3  B
  -3  0  6
  +4  0  7

The compare() function generates a new dataframe that displays the differences between df1 and df2. The resulting dataframe prefixes the rows with − to indicate that the corresponding values are present in df1 but not in df2. Although, the rows prefixed with + indicate that the corresponding values are present in df2 but not in df1. Additionally, the column headers of the columns with differences are included in the resulting dataframe for clarity.

Parameters and Options

The compare() function has several parameters and options that allow for more flexibility in comparing dataframes. Let's take a look at some of these.

‘Keep_shape’

The keep_shape parameter controls whether the compared dataframes should have the same shape. By default, this parameter is set to True, which means that the compared dataframes should have the same number of rows and columns. If set to False, the function will compare only the common columns and ignore any extra columns in either dataframe.

Example

Here is an example

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1, 2, 4], 'B': [4, 5, 7], 'C': [8, 9, 10]})

comparison = df1.compare(df2, keep_shape=False)
print(comparison)

Output

This will output the following result:

   A  B     
  -3  6  7
  +4  0  1

Notice that the extra column C in df2 is ignored.

‘keep_equal’

The keep_equal parameter is responsible for determining whether the compared dataframes must contain equivalent values in the compared columns. By default, this parameter is set to False, which implies that the compare() function will regard two values as equivalent, even if they have distinct types (such as 1 and 1.0). However, if keep_equal is set to True, the compare() function will only consider values with matching types as being equivalent.

Example

Here is an example:

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1.0, 2, 4], 'B': [4, 5, 7]})

comparison = df1.compare(df2, keep_equal=True)
print(comparison)

Output

This will output the following result:

   A        
   1  2  3
  +1  0  0
  -0  0  0

It is worth noting that when comparing two dataframes using the compare() function, values in a column that have different types will not be considered equal. For instance, when comparing df1 and df2, the +1 located under the second row of column A indicates that the value in df2 is one more than that in df1. This is because the value in df2 is a float, whereas the value in df1 is an integer. Similarly, the −0 under the third row of column A indicates that the value in df2 is zero less than that in df1, as the value in df2 is an integer, whereas the value in df1 is a float.

‘keep_shape’ and ‘keep_equal’

The keep_shape and keep_equal parameters can be used together to control both the shape and equality of the compared dataframes. For example:

Example

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1.0, 2, 4], 'B': [4, 5, 7], 'C': [8, 9, 10]})

comparison = df1.compare(df2, keep_shape=False, keep_equal=True)
print(comparison)

Output

This will output the following result:

   A        
   1  2  3
  +1  0  0
  -0  0  0

Notice that the extra column C in df2 is ignored, and the values in column A that have different types are not considered equal.

Conclusion

In summary, Pandas library offers a handy feature known as "compare," which enables data analysts and scientists to swiftly pinpoint and emphasize the dissimilarities between two DataFrames. The "compare" method provides flexibility to customize the comparison process based on specific requirements. Besides, Pandas offers a diverse array of other methods and tools to cater to different data analysis and manipulation needs. Therefore, honing skills in Pandas can be a significant advantage for data professionals seeking to efficiently handle and analyze large datasets.

Updated on: 20-Jul-2023

590 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements