How to compare two DataFrames in Python Pandas with missing values

When working with DataFrames containing missing values, comparing data becomes challenging because NumPy's NaN values don't behave like regular values. Understanding how to properly compare DataFrames with missing data is essential for data analysis tasks.

Understanding NaN Behavior

NumPy NaN values have unique mathematical properties that differ from Python's None object ?

import pandas as pd
import numpy as np

# Python None Object compared against self
print(f"Python None == None: {None == None}")

# Numpy nan compared against self
print(f"np.nan == np.nan: {np.nan == np.nan}")

# Is nan greater than numbers?
print(f"np.nan > 10: {np.nan > 10}")
Python None == None: True
np.nan == np.nan: False
np.nan > 10: False

Creating Sample Data with Missing Values

Let's create a DataFrame with tennis players and their Grand Slam titles, including missing values ?

import pandas as pd
import numpy as np

# Create dataframe with tennis players and their grandslam titles
df = pd.DataFrame(data={
    "players": ['Federer', 'Nadal', 'Djokovic', 'Murray', 'Medvedev', 'Zverev'],
    "titles": [20, 19, 17, 3, np.nan, np.nan]
})

# Set the index and sort
df.index = df['players']
df.sort_index(inplace=True, ascending=True)

print(df)
          players  titles
players                 
Djokovic  Djokovic    17.0
Federer   Federer     20.0
Medvedev  Medvedev     NaN
Murray    Murray       3.0
Nadal     Nadal       19.0
Zverev    Zverev       NaN

Problems with the Equals Operator (==)

Using the standard equals operator with DataFrames containing NaN values produces unexpected results ?

# Compare DataFrame against itself
df_compare = df == df
print("DataFrame comparison result:")
print(df_compare)

# Check if all values are True in each column
print(f"\nAll values True in each column:")
print(df_compare.all())
DataFrame comparison result:
          players  titles
players                 
Djokovic     True    True
Federer      True    True
Medvedev     True   False
Murray       True    True
Nadal        True    True
Zverev       True   False

All values True in each column:
players     True
titles     False
dtype: bool

Notice that the titles column shows False for rows with NaN values, even though we're comparing identical DataFrames.

Using the .equals() Method

The correct way to compare DataFrames with missing values is using the .equals() method, which treats NaN values in the same location as equal ?

# Compare DataFrames using .equals() method
result = df.equals(df)
print(f"df.equals(df): {result}")

# Create a copy and compare
df_copy = df.copy()
result_copy = df.equals(df_copy)
print(f"df.equals(df_copy): {result_copy}")
df.equals(df): True
df.equals(df_copy): True

Using assert_frame_equal for Testing

For unit testing purposes, you can use assert_frame_equal() which raises an error if DataFrames are not equal ?

from pandas.testing import assert_frame_equal

try:
    assert_frame_equal(df, df.copy())
    print("DataFrames are equal - no exception raised")
except AssertionError as e:
    print(f"DataFrames are not equal: {e}")
DataFrames are equal - no exception raised

Comparison Methods Summary

Method Handles NaN Return Type Best For
== operator No DataFrame of boolean Element-wise comparison
.equals() Yes Single boolean Overall DataFrame equality
assert_frame_equal() Yes None or Exception Unit testing

Conclusion

When comparing DataFrames with missing values, use .equals() for overall equality checks and assert_frame_equal() for testing. The standard == operator fails with NaN values because np.nan != np.nan in NumPy.

Updated on: 2026-03-25T11:53:51+05:30

2K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements