- Related Questions & Answers
- Python program to compare two Pandas series
- Merge, Join and Concatenate DataFrames using Pandas
- Making matplotlib scatter plots from dataframes in Python's pandas
- Compare two tables and return missing ids in MySQL?
- Python program to find missing and additional values in two lists?
- How to convert a column with missing values to binary with 0 for missing values in R?
- How to compare two lists in Python?
- How to compare two dates with JavaScript?
- Annotate bars with values on Pandas bar plots in Python
- Write a Python code to find a cross tabulation of two dataframes
- How to compare two strings using regex in Python?
- Java program to find missing and additional values in two lists
- How do we compare two lists in Python?
- How do we compare two tuples in Python?
- How do we compare two dictionaries in Python?

- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who

Pandas uses the NumPy NaN (np.nan) object to represent a missing value. This Numpy NaN value has some interesting mathematical properties. For example, it is not equal to itself. However, Python None object evaluates as True when compared to itself.

Let us see some examples to understand how np.nan behaves.

import pandas as pd import numpy as np # Python None Object compared against self. print(f"Output \n *** {None == None} ")

*** True

# Numpy nan compared against self. print(f"Output \n *** {np.nan == np.nan} ")

*** False

# Is nan > 10 or 1000 ? print(f"Output \n *** {np.nan > 10} ")

*** False

Traditionally, Series and DataFrames use the equals operator, ==, to make comparisons. The result of the comparsions is an object. Let us first see how to use the equals operator.

# create a dataframe with tennis players and their grandslam titles. df = pd.DataFrame(data={"players": ['Federer', 'Nadal', 'Djokovic', 'Murray','Medvedev','Zverev'], "titles": [20, 19, 17, 3,np.nan,np.nan]}) # set the index df.index = df['players'] # sort the index in ascending df.sort_index(inplace=True, ascending=True) # check if the index is set df.index.is_monotonic_increasing # see records print(f"Output \n{df}")

players titles players Djokovic Djokovic 17.0 Federer Federer 20.0 Medvedev Medvedev NaN Murray Murray 3.0 Nadal Nadal 19.0 Zverev Zverev NaN

1. To better understand, we will first compare all the players to a scalar value "Federer" and see the results.

print(f'Output \n {df == "Federer"}')

players titles players Djokovic False False Federer True False Medvedev False False Murray False False Nadal False False Zverev False False

C:\Users\sasan\anaconda3\lib\site-packages\pandas\core\ops\array_ops.py:253: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison res_values = method(rvalues)

2. This works as expected but becomes problematic whenever you try to compare DataFrames with missing values. To observer this let us compare the df against self.

df_compare = df == df print(f'Output \n {df_compare}')

players titles players Djokovic True True Federer True True Medvedev True False Murray True True Nadal True True Zverev True False

3. At first glance, all the values might appear to be correct, as you would expect. However, use the .all method to see if each column contains only True values (as it should be as we are comparing two similar objects right?) yields an unexpected result.

print(f'Output \n {df_compare.all()}')

players True titles False dtype: bool

4.As mentioned in earlier notes, this happens because missing values do not compare equally with one another. See, we clearly know that medvedev and Zverev have no titles (i.e. NaN) so if we add the number of missing values in each column we should get the value 2 for titles and 0 for players. Let us see what happens.

print(f'Output \n {(df_compare == np.nan).sum()}')

players 0 titles 0 dtype: int64

5. ABove result is unexpected as nan behaves very differently.

6. The correct way to compare two entire DataFrames with one another is not with the equals operator (==) but with the .equals method.

This method treats NaNs that are in the same location as equal.

AN important note the .eq method is the equivalent of == not .equals.

print(f'Output \n {df_compare.equals(df_compare)}')

True

7. There is also another way of doing if you are trying to compare two DataFrames as part of your unit testing. The assert_frame_equal function raises an AssertionError if two DataFrames are not equal. It returns None if the two DataFrames are equal.

from pandas.testing import assert_frame_equal print(f'Output \n {assert_frame_equal(df_compare, df_compare) is None}')

True

Advertisements