How to compare two DataFrames in Python Pandas with missing values

PythonServer Side ProgrammingProgramming


Pandas uses the NumPy NaN (np.nan) object to represent a missing value. This Numpy NaN value has some interesting mathematical properties. For example, it is not equal to itself. However, Python None object evaluates as True when compared to itself.

How to do it..

Let us see some examples to understand how np.nan behaves.

import pandas as pd
import numpy as np

# Python None Object compared against self.
print(f"Output \n *** {None == None} ")


*** True

# Numpy nan compared against self.
print(f"Output \n *** {np.nan == np.nan} ")


*** False

# Is nan > 10 or 1000 ?
print(f"Output \n *** {np.nan > 10} ")


*** False

Traditionally, Series and DataFrames use the equals operator, ==, to make comparisons. The result of the comparsions is an object. Let us first see how to use the equals operator.

# create a dataframe with tennis players and their grandslam titles.
df = pd.DataFrame(data={"players": ['Federer', 'Nadal', 'Djokovic', 'Murray','Medvedev','Zverev'],
"titles": [20, 19, 17, 3,np.nan,np.nan]})
# set the index
df.index = df['players']

# sort the index in ascending
df.sort_index(inplace=True, ascending=True)

# check if the index is set

# see records
print(f"Output \n{df}")


         players    titles
Djokovic Djokovic     17.0
Federer  Federer      20.0
Medvedev Medvedev     NaN
Murray   Murray       3.0
Nadal    Nadal        19.0
Zverev Zverev      NaN

1. To better understand, we will first compare all the players to a scalar value "Federer" and see the results.

print(f'Output \n {df == "Federer"}')


          players    titles
Djokovic  False       False
Federer   True        False
Medvedev   False      False
Murray     False      False
Nadal      False      False
Zverev     False      False

C:\Users\sasan\anaconda3\lib\site-packages\pandas\core\ops\ FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
res_values = method(rvalues)

2. This works as expected but becomes problematic whenever you try to compare DataFrames with missing values. To observer this let us compare the df against self.

df_compare = df == df
print(f'Output \n {df_compare}')


players titles
Djokovic True True
Federer True True
Medvedev True False
Murray True True
Nadal True True
Zverev True False

3. At first glance, all the values might appear to be correct, as you would expect. However, use the .all method to see if each column contains only True values (as it should be as we are comparing two similar objects right?) yields an unexpected result.

print(f'Output \n {df_compare.all()}')


players True
titles False
dtype: bool

4.As mentioned in earlier notes, this happens because missing values do not compare equally with one another. See, we clearly know that medvedev and Zverev have no titles (i.e. NaN) so if we add the number of missing values in each column we should get the value 2 for titles and 0 for players. Let us see what happens.

print(f'Output \n {(df_compare == np.nan).sum()}')


players 0
titles 0
dtype: int64

5. ABove result is unexpected as nan behaves very differently.

6. The correct way to compare two entire DataFrames with one another is not with the equals operator (==) but with the .equals method.

This method treats NaNs that are in the same location as equal.

AN important note the .eq method is the equivalent of == not .equals.

print(f'Output \n {df_compare.equals(df_compare)}')



7. There is also another way of doing if you are trying to compare two DataFrames as part of your unit testing. The assert_frame_equal function raises an AssertionError if two DataFrames are not equal. It returns None if the two DataFrames are equal.

from pandas.testing import assert_frame_equal
print(f'Output \n {assert_frame_equal(df_compare, df_compare) is None}')


Published on 09-Nov-2020 15:06:07