How to Use Pandas filter with IQR?


Pandas is an open-source Python library used for data analysis and manipulation. Pandas provides functionality for data cleaning, transformation, and filtering. In large datasets, some extreme values called outliers can modify the data analysis result. In order to identify those outliers, a robust statistical measure called the Interquartile range (IQR) is used. In this article, we will understand how pandas filter with the IQR to identify and handle outliers in the dataset.

Understanding the Interquartile Range (IQR)

Before understanding how to use the Pandas filter with IQR, let’s briefly understand what is Interquartile range(IQR). Quartile divides a dataset into four equal parts, where Q1 represents the 25th percentile and Q3 represents the 75th percentile. The IQR is calculated as the difference between the 75th percentile i.e. Q3 and the 25th percentile i.e. Q1. Simply, IQR is the measure of statistical dispersion, specifically the range between the first quartile(Q1) and the third quartile(Q3) in a dataset.

IQR = Q3 - Q1

Steps for using Pandas Filter with IQR

We will be using the housing dataset from Kaggle which can be downloaded from the Kaggle website. (https://www.kaggle.com/datasets/ashydv/housing-dataset?resource=download). Kaggle is a popular platform for finding and sharing datasets. It hosts a wide range of datasets, including housing-related data. You can search for housing datasets on Kaggle at https://www.kaggle.com/datasets.

Step 1: Importing the Necessary Libraries

The first step is to import the required libraries that will be used for using pandas and numpy methods. In this example, we will import the pandas and numpy library using the import attribute in Python.

import pandas as pd
import numpy as np

Step 2: Loading the Dataset

Now after importing the required library, we need to read our Housing dataset. Here, we will read the CSV file using the read_csv() method of Python.

data = pd.read_csv('housing_data.csv')

Step 3: Exploring the Dataset

Before applying the IQR filter, it's important to explore the dataset and understand its structure. We can use various Pandas functions to gain insights into the data. For example, head() displays the first few rows, info() provides a summary of the dataset, and describe() calculates statistical measures for each column.

print(data.head())
print(data.info())
print(data.describe())

Output

      price  area  bedrooms  bathrooms  stories mainroad guestroom basement  \
0  13300000  7420         4          2        3      yes        no       no   
1  12250000  8960         4          4        4      yes        no       no   
2  12250000  9960         3          2        2      yes        no      yes   
3  12215000  7500         4          2        2      yes        no      yes   
4  11410000  7420         4          1        2      yes       yes      yes   

  hotwaterheating airconditioning  parking prefarea furnishingstatus  
0              no             yes        2      yes        furnished  
1              no             yes        3       no        furnished  
2              no              no        2      yes   semi-furnished  
3              no             yes        3      yes        furnished  
4              no             yes        2       no        furnished  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544


Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   price             545 non-null    int64 
 1   area              545 non-null    int64 
 2   bedrooms          545 non-null    int64 
 3   bathrooms         545 non-null    int64 
 4   stories           545 non-null    int64 
 5   mainroad          545 non-null    object
 6   guestroom         545 non-null    object
 7   basement          545 non-null    object
 8   hotwaterheating   545 non-null    object
 9   airconditioning   545 non-null    object
 10  parking           545 non-null    int64 
 11  prefarea          545 non-null    object
 12  furnishingstatus  545 non-null    object
dtypes: int64(6), object(7)
memory usage: 55.5+ KB


              price          area    bedrooms   bathrooms     stories  \
count  5.450000e+02    545.000000  545.000000  545.000000  545.000000   
mean   4.766729e+06   5150.541284    2.965138    1.286239    1.805505   
std    1.870440e+06   2170.141023    0.738064    0.502470    0.867492   
min    1.750000e+06   1650.000000    1.000000    1.000000    1.000000   
25%    3.430000e+06   3600.000000    2.000000    1.000000    1.000000   
50%    4.340000e+06   4600.000000    3.000000    1.000000    2.000000   
75%    5.740000e+06   6360.000000    3.000000    2.000000    2.000000   
max    1.330000e+07  16200.000000    6.000000    4.000000    4.000000   

          parking  
count  545.000000  
mean     0.693578  
std      0.861586  
min      0.000000  
25%      0.000000  
50%      0.000000  
75%      1.000000  
max      3.000000 

Step 4: Computing the IQR

Once we have an understanding of the dataset, we can calculate the IQR for each numerical column. To do this, we use the quantile() function in Pandas, which calculates the desired quantile for a given dataset. In our case, we need the 25th percentile (Q1) and the 75th percentile (Q3) for each column.

Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1

Step 5: Applying the IQR filter

In the previous step, we calculated the IQR value for each column, now we can apply the filter to identify outliers. The general rule is that any value below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR is considered an outlier. We can create a boolean mask using logical operators to filter out these values.

outliers = (data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))

Step 6: Handling the Outliers

Once we have identified the outliers, we can decide how to handle them. There are several options available, depending on the nature of the dataset and the analysis goals. Some common approaches are as follows:

Removing the outliers: In this approach, we simply remove the rows containing the outliers from the dataset. This approach is useful when the outliers are unlikely to occur in future data points or when the outliers are due to data entry errors or measurement errors. We can use the drop() function in Pandas to remove the rows with outliers.

clean_data = data[~outliers.any(axis=1)]

Imputing the Outliers: Instead of removing the outliers, we can replace them with more reasonable values. This approach is useful when we believe the outliers are valid data points but still want to mitigate their impact. Imputation methods include replacing the outliers with the mean, median, or custom value.

data[outliers] = data.mean()  # Replace outliers with the mean

Capping the outliers: In some cases, it may be appropriate to cap the outliers by replacing them with the nearest non-outlier value. This approach maintains the overall distribution while reducing the influence of extreme values.

data[data < (Q1 - 1.5 * IQR)] = Q1 - 1.5 * IQR  # Cap low outliers
data[data > (Q3 + 1.5 * IQR)] = Q3 + 1.5 * IQR  # Cap high outliers

In our above example, we will simply remove the row containing the outlier. You can use a different approach for outlier handling depending on the type of your dataset. The full code for the above Housing dataset can be found below:

import pandas as pd
import numpy as np

# Step 2: Loading the Dataset
data = pd.read_csv('Housing.csv')

# Step 3: Exploring the Dataset
#print(data.head())
#print(data.info())
#print(data.describe())

# Step 4: Computing the IQR
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1

# Step 5: Applying the IQR Filter
outliers = (data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))

# Step 6: Handling the Outliers
clean_data = data[~outliers.any(axis=1)]

# Step 7: Conclusion
print(clean_data.head())

Output

     price  area  bedrooms  bathrooms  stories mainroad guestroom basement  \
15  9100000  6000         4          1        2      yes        no      yes   
16  9100000  6600         4          2        2      yes       yes      yes   
18  8890000  4600         3          2        2      yes       yes       no   
19  8855000  6420         3          2        2      yes        no       no   
20  8750000  4320         3          1        2      yes        no      yes   

   hotwaterheating airconditioning  parking prefarea furnishingstatus  
15              no              no        2       no   semi-furnished  
16              no             yes        1      yes      unfurnished  
18              no             yes        2       no        furnished  
19              no             yes        1      yes   semi-furnished  
20             yes              no        2       no   semi-furnished 

Conclusion

In this article, we discussed how we can use the Pandas filter with IQR to determine and handle the outliers in a dataset. We used the Housing dataset from Kaggle to perform the filtering operation on the dataset. By calculating the IQR and applying the filter, we can easily identify values that deviate significantly from the central distribution. Depending on the analysis goals, we can either remove the outliers, impute them with reasonable values, or cap them to reduce their impact.

Updated on: 13-Oct-2023

405 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements