Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to compare two DataFrames in Python Pandas with missing values
When working with DataFrames containing missing values, comparing data becomes challenging because NumPy's NaN values don't behave like regular values. Understanding how to properly compare DataFrames with missing data is essential for data analysis tasks.
Understanding NaN Behavior
NumPy NaN values have unique mathematical properties that differ from Python's None object ?
import pandas as pd
import numpy as np
# Python None Object compared against self
print(f"Python None == None: {None == None}")
# Numpy nan compared against self
print(f"np.nan == np.nan: {np.nan == np.nan}")
# Is nan greater than numbers?
print(f"np.nan > 10: {np.nan > 10}")
Python None == None: True np.nan == np.nan: False np.nan > 10: False
Creating Sample Data with Missing Values
Let's create a DataFrame with tennis players and their Grand Slam titles, including missing values ?
import pandas as pd
import numpy as np
# Create dataframe with tennis players and their grandslam titles
df = pd.DataFrame(data={
"players": ['Federer', 'Nadal', 'Djokovic', 'Murray', 'Medvedev', 'Zverev'],
"titles": [20, 19, 17, 3, np.nan, np.nan]
})
# Set the index and sort
df.index = df['players']
df.sort_index(inplace=True, ascending=True)
print(df)
players titles
players
Djokovic Djokovic 17.0
Federer Federer 20.0
Medvedev Medvedev NaN
Murray Murray 3.0
Nadal Nadal 19.0
Zverev Zverev NaN
Problems with the Equals Operator (==)
Using the standard equals operator with DataFrames containing NaN values produces unexpected results ?
# Compare DataFrame against itself
df_compare = df == df
print("DataFrame comparison result:")
print(df_compare)
# Check if all values are True in each column
print(f"\nAll values True in each column:")
print(df_compare.all())
DataFrame comparison result:
players titles
players
Djokovic True True
Federer True True
Medvedev True False
Murray True True
Nadal True True
Zverev True False
All values True in each column:
players True
titles False
dtype: bool
Notice that the titles column shows False for rows with NaN values, even though we're comparing identical DataFrames.
Using the .equals() Method
The correct way to compare DataFrames with missing values is using the .equals() method, which treats NaN values in the same location as equal ?
# Compare DataFrames using .equals() method
result = df.equals(df)
print(f"df.equals(df): {result}")
# Create a copy and compare
df_copy = df.copy()
result_copy = df.equals(df_copy)
print(f"df.equals(df_copy): {result_copy}")
df.equals(df): True df.equals(df_copy): True
Using assert_frame_equal for Testing
For unit testing purposes, you can use assert_frame_equal() which raises an error if DataFrames are not equal ?
from pandas.testing import assert_frame_equal
try:
assert_frame_equal(df, df.copy())
print("DataFrames are equal - no exception raised")
except AssertionError as e:
print(f"DataFrames are not equal: {e}")
DataFrames are equal - no exception raised
Comparison Methods Summary
| Method | Handles NaN | Return Type | Best For |
|---|---|---|---|
== operator |
No | DataFrame of boolean | Element-wise comparison |
.equals() |
Yes | Single boolean | Overall DataFrame equality |
assert_frame_equal() |
Yes | None or Exception | Unit testing |
Conclusion
When comparing DataFrames with missing values, use .equals() for overall equality checks and assert_frame_equal() for testing. The standard == operator fails with NaN values because np.nan != np.nan in NumPy.
