- Data Structure
- Networking
- RDBMS
- Operating System
- Java
- MS Excel
- iOS
- HTML
- CSS
- Android
- Python
- C Programming
- C++
- C#
- MongoDB
- MySQL
- Javascript
- PHP
- Physics
- Chemistry
- Biology
- Mathematics
- English
- Economics
- Psychology
- Social Studies
- Fashion Studies
- Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
How to Use Pandas filter with IQR?
Pandas is an open-source Python library used for data analysis and manipulation. Pandas provides functionality for data cleaning, transformation, and filtering. In large datasets, some extreme values called outliers can modify the data analysis result. In order to identify those outliers, a robust statistical measure called the Interquartile range (IQR) is used. In this article, we will understand how pandas filter with the IQR to identify and handle outliers in the dataset.
Understanding the Interquartile Range (IQR)
Before understanding how to use the Pandas filter with IQR, let’s briefly understand what is Interquartile range(IQR). Quartile divides a dataset into four equal parts, where Q1 represents the 25th percentile and Q3 represents the 75th percentile. The IQR is calculated as the difference between the 75th percentile i.e. Q3 and the 25th percentile i.e. Q1. Simply, IQR is the measure of statistical dispersion, specifically the range between the first quartile(Q1) and the third quartile(Q3) in a dataset.
IQR = Q3 - Q1
Steps for using Pandas Filter with IQR
We will be using the housing dataset from Kaggle which can be downloaded from the Kaggle website. (https://www.kaggle.com/datasets/ashydv/housing-dataset?resource=download). Kaggle is a popular platform for finding and sharing datasets. It hosts a wide range of datasets, including housing-related data. You can search for housing datasets on Kaggle at https://www.kaggle.com/datasets.
Step 1: Importing the Necessary Libraries
The first step is to import the required libraries that will be used for using pandas and numpy methods. In this example, we will import the pandas and numpy library using the import attribute in Python.
import pandas as pd import numpy as np
Step 2: Loading the Dataset
Now after importing the required library, we need to read our Housing dataset. Here, we will read the CSV file using the read_csv() method of Python.
data = pd.read_csv('housing_data.csv')
Step 3: Exploring the Dataset
Before applying the IQR filter, it's important to explore the dataset and understand its structure. We can use various Pandas functions to gain insights into the data. For example, head() displays the first few rows, info() provides a summary of the dataset, and describe() calculates statistical measures for each column.
print(data.head()) print(data.info()) print(data.describe())
Output
price area bedrooms bathrooms stories mainroad guestroom basement \ 0 13300000 7420 4 2 3 yes no no 1 12250000 8960 4 4 4 yes no no 2 12250000 9960 3 2 2 yes no yes 3 12215000 7500 4 2 2 yes no yes 4 11410000 7420 4 1 2 yes yes yes hotwaterheating airconditioning parking prefarea furnishingstatus 0 no yes 2 yes furnished 1 no yes 3 no furnished 2 no no 2 yes semi-furnished 3 no yes 3 yes furnished 4 no yes 2 no furnished <class 'pandas.core.frame.DataFrame'> RangeIndex: 545 entries, 0 to 544 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 price 545 non-null int64 1 area 545 non-null int64 2 bedrooms 545 non-null int64 3 bathrooms 545 non-null int64 4 stories 545 non-null int64 5 mainroad 545 non-null object 6 guestroom 545 non-null object 7 basement 545 non-null object 8 hotwaterheating 545 non-null object 9 airconditioning 545 non-null object 10 parking 545 non-null int64 11 prefarea 545 non-null object 12 furnishingstatus 545 non-null object dtypes: int64(6), object(7) memory usage: 55.5+ KB price area bedrooms bathrooms stories \ count 5.450000e+02 545.000000 545.000000 545.000000 545.000000 mean 4.766729e+06 5150.541284 2.965138 1.286239 1.805505 std 1.870440e+06 2170.141023 0.738064 0.502470 0.867492 min 1.750000e+06 1650.000000 1.000000 1.000000 1.000000 25% 3.430000e+06 3600.000000 2.000000 1.000000 1.000000 50% 4.340000e+06 4600.000000 3.000000 1.000000 2.000000 75% 5.740000e+06 6360.000000 3.000000 2.000000 2.000000 max 1.330000e+07 16200.000000 6.000000 4.000000 4.000000 parking count 545.000000 mean 0.693578 std 0.861586 min 0.000000 25% 0.000000 50% 0.000000 75% 1.000000 max 3.000000
Step 4: Computing the IQR
Once we have an understanding of the dataset, we can calculate the IQR for each numerical column. To do this, we use the quantile() function in Pandas, which calculates the desired quantile for a given dataset. In our case, we need the 25th percentile (Q1) and the 75th percentile (Q3) for each column.
Q1 = data.quantile(0.25) Q3 = data.quantile(0.75) IQR = Q3 - Q1
Step 5: Applying the IQR filter
In the previous step, we calculated the IQR value for each column, now we can apply the filter to identify outliers. The general rule is that any value below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR is considered an outlier. We can create a boolean mask using logical operators to filter out these values.
outliers = (data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))
Step 6: Handling the Outliers
Once we have identified the outliers, we can decide how to handle them. There are several options available, depending on the nature of the dataset and the analysis goals. Some common approaches are as follows:
Removing the outliers: In this approach, we simply remove the rows containing the outliers from the dataset. This approach is useful when the outliers are unlikely to occur in future data points or when the outliers are due to data entry errors or measurement errors. We can use the drop() function in Pandas to remove the rows with outliers.
clean_data = data[~outliers.any(axis=1)]
Imputing the Outliers: Instead of removing the outliers, we can replace them with more reasonable values. This approach is useful when we believe the outliers are valid data points but still want to mitigate their impact. Imputation methods include replacing the outliers with the mean, median, or custom value.
data[outliers] = data.mean() # Replace outliers with the mean
Capping the outliers: In some cases, it may be appropriate to cap the outliers by replacing them with the nearest non-outlier value. This approach maintains the overall distribution while reducing the influence of extreme values.
data[data < (Q1 - 1.5 * IQR)] = Q1 - 1.5 * IQR # Cap low outliers data[data > (Q3 + 1.5 * IQR)] = Q3 + 1.5 * IQR # Cap high outliers
In our above example, we will simply remove the row containing the outlier. You can use a different approach for outlier handling depending on the type of your dataset. The full code for the above Housing dataset can be found below:
import pandas as pd import numpy as np # Step 2: Loading the Dataset data = pd.read_csv('Housing.csv') # Step 3: Exploring the Dataset #print(data.head()) #print(data.info()) #print(data.describe()) # Step 4: Computing the IQR Q1 = data.quantile(0.25) Q3 = data.quantile(0.75) IQR = Q3 - Q1 # Step 5: Applying the IQR Filter outliers = (data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR)) # Step 6: Handling the Outliers clean_data = data[~outliers.any(axis=1)] # Step 7: Conclusion print(clean_data.head())
Output
price area bedrooms bathrooms stories mainroad guestroom basement \ 15 9100000 6000 4 1 2 yes no yes 16 9100000 6600 4 2 2 yes yes yes 18 8890000 4600 3 2 2 yes yes no 19 8855000 6420 3 2 2 yes no no 20 8750000 4320 3 1 2 yes no yes hotwaterheating airconditioning parking prefarea furnishingstatus 15 no no 2 no semi-furnished 16 no yes 1 yes unfurnished 18 no yes 2 no furnished 19 no yes 1 yes semi-furnished 20 yes no 2 no semi-furnished
Conclusion
In this article, we discussed how we can use the Pandas filter with IQR to determine and handle the outliers in a dataset. We used the Housing dataset from Kaggle to perform the filtering operation on the dataset. By calculating the IQR and applying the filter, we can easily identify values that deviate significantly from the central distribution. Depending on the analysis goals, we can either remove the outliers, impute them with reasonable values, or cap them to reduce their impact.