Drop rows from Pandas dataframe with missing values or NaN in columns


A dataset consists of a wide variety of values. These values can be a “string”, “integer”, “decimal” “Boolean” or even a “data structure”. These datasets are extremely valuable and can be used in various purposes. We can train model, interpret results, produce a hypothesis and build applications with the help a dataset.

However, sometimes a dataset can contain values that are not necessary for our purpose. These values are called “NaN” (not a number). In this article, we will be dealing with these “NaN” or missing values.

Our objective is to drop to those rows that contain any “NaN” value from the pandas data frame. We will create a data frame with the help of a dataset and use the functions of the pandas libraries to drop rows. Let’s begin with the topic.

Creating a Pandas Data Frame with NaN Values

A pandas data frame is a 2D tabular arrangement of data that is widely used for data analysis, interpretation and manipulation. It is a user-friendly framework that organises data into rows and columns. Pandas offers numerous functions that allows the sorting, merging, filtering and deletion of the data. Let’s build a pandas data frame.

Example

In the following example, we passed a dictionary dataset where each key represents a column label and the associated values are through a list.

Then, we have created a pandas data frame through “pd.DataFrame” method. We passed a list of row labels. In the dataset, we assigned some “NaN” values through numpy library.

import numpy as np
import pandas as pd

dataset = {"Student name": ["Ajay", "Krishna", "Deepak", "Swati"], "Roll number": [23, 45, np.nan, 18],
           "Major Subject": ["Maths", "Physics", "Arts", "Political science"], "Marks": [57, numpy.nan, 98, numpy.nan]}

dataframe = pd.DataFrame(dataset, index= [1, 2, 3, 4])
print("The original data frame is: -")
print(dataframe)

Output

The original data frame is: -
  Student name  Roll number      Major Subject  Marks
1         Ajay         23.0              Maths   57.0
2      Krishna         45.0            Physics    NaN
3       Deepak          NaN               Arts   98.0
4        Swati         18.0  Political science    NaN

Using dropna() Function to Drop Rows with “NaN” Values

We can use the “dropna()” function to drop rows or columns from the data frame.

  • After creating the data frame, we used the “dropna()” function to drop all the rows containing any “NaN” value.

  • We created a new data frame “drop_dataframe” which contains the modified values and printed it.

  • Here, the 2nd, 3rd and 4th row is dropped.

Example

import numpy as np
import pandas as pd

dataset = {"Student name": ["Ajay", "Krishna", "Deepak", "Swati"], "Roll number": [23, 45, np.nan, 18],
           "Major Subject": ["Maths", "Physics", "Arts", "Political science"], "Marks": [57, np.nan, 98, np.nan]}

dataframe = pd.DataFrame(dataset, index= [1, 2, 3, 4])
print("The original data frame is: -")
print(dataframe)

drop_dataframe = dataframe.dropna()
print("The data frame after dropping the rows: -")
print(drop_dataframe)

Output

The original data frame is: -
  Student name  Roll number      Major Subject  Marks
1         Ajay         23.0              Maths   57.0
2      Krishna         45.0            Physics    NaN
3       Deepak          NaN               Arts   98.0
4        Swati         18.0  Political science    NaN
The data frame after dropping the rows: -
  Student name  Roll number       Major Subject  Marks
1         Ajay         23.0              Maths   57.0

If we don’t want a new data frame, we can simply make changes to the existing one. This can be achieved by passing a “inplace = True” clause.

dataframe.dropna(inplace=True)
print("The data frame after dropping the rows: -")
print(dataframe)

Dropping an Entire Row of “NaN” Values

We can pass the “how = all” clause as the argument for “pd.DataFrame” method to drop only those rows in which all the values are “NaN”.

Example

import numpy as np
import pandas as pd

dataset = {"Student name": ["Ajay", "Krishna", np.nan, "Swati"], "Roll number": [23, 45, np.nan, 18],
           "Major Subject": ["Maths", "Physics", np.nan, "Political science"], "Marks": [57, 25, np.nan, np.nan]}

dataframe = pd.DataFrame(dataset, index= [1, 2, 3, 4])
print("The original data frame is: -")
print(dataframe)

dataframe.dropna(how= "all", inplace= True)
print("The data frame after dropping the rows: -")
print(dataframe)

Output

Student name  Roll number      Major Subject  Marks
1         Ajay         23.0              Maths   57.0
2      Krishna         45.0            Physics   25.0
3          NaN          NaN                NaN    NaN
4        Swati         18.0  Political science    NaN
The data frame after dropping the rows: -
  Student name  Roll number      Major Subject  Marks
1         Ajay         23.0              Maths   57.0
2      Krishna         45.0            Physics   25.0
4        Swati         18.0  Political science    NaN

Here, only the 3rd row was dropped as it contained only “NaN” values. We can also apply conditions for dropping “NaN” values but it depends upon the programmer’s purpose and how he/she wants to structure the data frame.

Using Fillna() Function and Drop() Function

This is an indirect method of dropping rows with missing values. Let’s assume we don’t know how many “NaN” values are present in a data frame. In such a case, we will create a general program to check each column.

Example

We used the fillna() function to replace all the “NaN” values with 1. After this we used “.index” method to retrieve the index values of the columns that contain 1. Assuming we don’t know how many columns contain how many “NaN” values, we included all the columns. We used the drop() function and passed a list of index values to drop the rows.

import numpy as np
import pandas as pd

dataset = {"Student name": ["Ajay", "Krishna", "Deepak", "Swati"], "Roll number": [23, 45, np.nan, 18],
           "Major Subject": ["Maths", "Physics", "Arts", "Political science"], "Marks": [57, np.nan, 98, np.nan]}

dataframe = pd.DataFrame(dataset, index= [1, 2, 3, 4])
print("The original data frame is: -")
print(dataframe)

dataframe.fillna(1, inplace= True)
index_values = dataframe[(dataframe["Student name"] == 1) | (dataframe["Roll number"] == 1) |
               (dataframe["Major Subject"] == 1) | (dataframe["Marks"] == 1)].index

dataframe.drop(index_values, inplace=True)
print("The data frame after dropping rows: -")
print(dataframe)

Output

The original data frame is: -
  Student name  Roll number      Major Subject  Marks
1         Ajay         23.0              Maths   57.0
2      Krishna         45.0            Physics    NaN
3       Deepak          NaN               Arts   98.0
4        Swati         18.0  Political science    NaN
The data frame after dropping rows: -
  Student name  Roll number      Major Subject   Marks
1         Ajay         23.0              Maths    57.0

Conclusion

In this article, we discussed about a basic operation of dropping rows that contain “NaN” values from a pandas data frame. We prepared an appropriate dataset and used the numpy library to include the “NaN” values in our dataset. We understood the application of “dropna()” function. The missing data was eliminated and a new data frame was generated.

Updated on: 05-May-2023

4K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements