- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
Physics
Chemistry
Biology
Mathematics
English
Economics
Psychology
Social Studies
Fashion Studies
Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
How to compare two DataFrames in Python Pandas with missing values
Introduction
Pandas uses the NumPy NaN (np.nan) object to represent a missing value. This Numpy NaN value has some interesting mathematical properties. For example, it is not equal to itself. However, Python None object evaluates as True when compared to itself.
How to do it..
Let us see some examples to understand how np.nan behaves.
import pandas as pd import numpy as np # Python None Object compared against self. print(f"Output \n *** {None == None} ")
Output
*** True
# Numpy nan compared against self. print(f"Output \n *** {np.nan == np.nan} ")
Output
*** False
# Is nan > 10 or 1000 ? print(f"Output \n *** {np.nan > 10} ")
Output
*** False
Traditionally, Series and DataFrames use the equals operator, ==, to make comparisons. The result of the comparsions is an object. Let us first see how to use the equals operator.
# create a dataframe with tennis players and their grandslam titles. df = pd.DataFrame(data={"players": ['Federer', 'Nadal', 'Djokovic', 'Murray','Medvedev','Zverev'], "titles": [20, 19, 17, 3,np.nan,np.nan]}) # set the index df.index = df['players'] # sort the index in ascending df.sort_index(inplace=True, ascending=True) # check if the index is set df.index.is_monotonic_increasing # see records print(f"Output \n{df}")
Output
players titles players Djokovic Djokovic 17.0 Federer Federer 20.0 Medvedev Medvedev NaN Murray Murray 3.0 Nadal Nadal 19.0 Zverev Zverev NaN
1. To better understand, we will first compare all the players to a scalar value "Federer" and see the results.
print(f'Output \n {df == "Federer"}')
Output
players titles players Djokovic False False Federer True False Medvedev False False Murray False False Nadal False False Zverev False False
C:\Users\sasan\anaconda3\lib\site-packages\pandas\core\ops\array_ops.py:253: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison res_values = method(rvalues)
2. This works as expected but becomes problematic whenever you try to compare DataFrames with missing values. To observer this let us compare the df against self.
df_compare = df == df print(f'Output \n {df_compare}')
Output
players titles players Djokovic True True Federer True True Medvedev True False Murray True True Nadal True True Zverev True False
3. At first glance, all the values might appear to be correct, as you would expect. However, use the .all method to see if each column contains only True values (as it should be as we are comparing two similar objects right?) yields an unexpected result.
print(f'Output \n {df_compare.all()}')
Output
players True titles False dtype: bool
4.As mentioned in earlier notes, this happens because missing values do not compare equally with one another. See, we clearly know that medvedev and Zverev have no titles (i.e. NaN) so if we add the number of missing values in each column we should get the value 2 for titles and 0 for players. Let us see what happens.
print(f'Output \n {(df_compare == np.nan).sum()}')
Output
players 0 titles 0 dtype: int64
5. ABove result is unexpected as nan behaves very differently.
6. The correct way to compare two entire DataFrames with one another is not with the equals operator (==) but with the .equals method.
This method treats NaNs that are in the same location as equal.
AN important note the .eq method is the equivalent of == not .equals.
print(f'Output \n {df_compare.equals(df_compare)}')
Output
True
7. There is also another way of doing if you are trying to compare two DataFrames as part of your unit testing. The assert_frame_equal function raises an AssertionError if two DataFrames are not equal. It returns None if the two DataFrames are equal.
from pandas.testing import assert_frame_equal print(f'Output \n {assert_frame_equal(df_compare, df_compare) is None}')
Output
True