Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to Use Pandas filter with IQR?
Pandas is an open-source Python library used for data analysis and manipulation. In large datasets, some extreme values called outliers can modify the data analysis result. The Interquartile Range (IQR) is a robust statistical measure used to identify and handle these outliers.
Understanding the Interquartile Range (IQR)
The IQR divides a dataset into quartiles, where Q1 represents the 25th percentile and Q3 represents the 75th percentile. The IQR is calculated as the difference between Q3 and Q1 ?
# IQR Formula IQR = Q3 - Q1
Any value below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR is considered an outlier.
Creating Sample Data
Let's create sample housing data to demonstrate IQR filtering ?
import pandas as pd
import numpy as np
# Create sample housing data
np.random.seed(42)
data = {
'price': np.random.normal(5000000, 1500000, 100).astype(int),
'area': np.random.normal(5000, 2000, 100).astype(int),
'bedrooms': np.random.randint(1, 6, 100),
'bathrooms': np.random.randint(1, 4, 100)
}
# Add some outliers
data['price'][0] = 15000000 # High outlier
data['area'][1] = 15000 # High outlier
data['price'][2] = 1000000 # Low outlier
df = pd.DataFrame(data)
print("Original dataset shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
Original dataset shape: (100, 4)
First 5 rows:
price area bedrooms bathrooms
0 15000000 4967 4 1
1 4434262 15000 2 1
2 1000000 6951 4 3
3 6396737 1004 2 1
4 4836637 7395 2 2
Computing the IQR
Calculate the quartiles and IQR for numerical columns ?
# Calculate quartiles for numerical columns only
numerical_cols = df.select_dtypes(include=[np.number]).columns
print("Numerical columns:", list(numerical_cols))
Q1 = df[numerical_cols].quantile(0.25)
Q3 = df[numerical_cols].quantile(0.75)
IQR = Q3 - Q1
print("\nQ1 (25th percentile):")
print(Q1)
print("\nQ3 (75th percentile):")
print(Q3)
print("\nIQR:")
print(IQR)
Numerical columns: ['price', 'area', 'bedrooms', 'bathrooms'] Q1 (25th percentile): price 3644906.50 area 3616.25 bedrooms 2.00 bathrooms 1.00 Name: 0.25, dtype: float64 Q3 (75th percentile): price 6181262.25 area 6244.50 bedrooms 4.00 bathrooms 2.00 Name: 0.75, dtype: float64 IQR: price 2536355.75 area 2628.25 bedrooms 2.00 bathrooms 1.00 Name: 0.5, dtype: float64
Identifying Outliers
Create boolean masks to identify outliers using the IQR method ?
# Calculate outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print("Lower bounds:")
print(lower_bound)
print("\nUpper bounds:")
print(upper_bound)
# Identify outliers
outliers = (df[numerical_cols] < lower_bound) | (df[numerical_cols] > upper_bound)
print(f"\nNumber of outliers per column:")
print(outliers.sum())
# Find rows with any outliers
outlier_rows = outliers.any(axis=1)
print(f"\nTotal rows with outliers: {outlier_rows.sum()}")
print("Outlier rows:")
print(df[outlier_rows][['price', 'area']])
Lower bounds:
price -161627.125
area -327.125
bedrooms -1.000
bathrooms -0.500
Name: 0.25, dtype: float64
Upper bounds:
price 9985796.375
area 9188.875
bedrooms 7.000
bathrooms 3.500
Name: 0.75, dtype: float64
Number of outliers per column:
price 2
area 1
bedrooms 0
bathrooms 0
dtype: int64
Total rows with outliers: 3
Outlier rows:
price area
0 15000000 4967
1 4434262 15000
3 6396737 1004
Method 1: Removing Outliers
Remove rows containing outliers from the dataset ?
# Remove rows with outliers
clean_data = df[~outlier_rows].copy()
print("Original dataset shape:", df.shape)
print("Cleaned dataset shape:", clean_data.shape)
print(f"Removed {df.shape[0] - clean_data.shape[0]} rows")
print("\nFirst 5 rows after cleaning:")
print(clean_data.head())
Original dataset shape: (100, 4)
Cleaned dataset shape: (97, 3)
Removed 3 rows
First 5 rows after cleaning:
price area bedrooms bathrooms
2 1000000 6951 4 3
4 4836637 7395 2 2
5 8917445 6405 4 2
6 1946615 4326 3 1
7 4019269 1775 2 2
Method 2: Capping Outliers
Replace outliers with boundary values instead of removing them ?
# Create a copy for capping
capped_data = df.copy()
# Cap outliers for each numerical column
for col in numerical_cols:
# Cap low outliers
capped_data.loc[capped_data[col] < lower_bound[col], col] = lower_bound[col]
# Cap high outliers
capped_data.loc[capped_data[col] > upper_bound[col], col] = upper_bound[col]
print("Before capping - outlier values:")
print("Max price:", df['price'].max())
print("Max area:", df['area'].max())
print("\nAfter capping - outlier values:")
print("Max price:", capped_data['price'].max())
print("Max area:", capped_data['area'].max())
Before capping - outlier values: Max price: 15000000 Max area: 15000 After capping - outlier values: Max price: 9985796 Max area: 9188
Comparison of Methods
| Method | Rows Kept | Data Integrity | Best For |
|---|---|---|---|
| Remove Outliers | 97/100 | High (pure data) | Data entry errors |
| Cap Outliers | 100/100 | Medium (modified) | Valid extreme values |
Complete Example
import pandas as pd
import numpy as np
def filter_outliers_iqr(df, method='remove'):
"""
Filter outliers using IQR method
Parameters:
df: DataFrame to filter
method: 'remove' or 'cap'
Returns:
Filtered DataFrame
"""
# Select only numerical columns
numerical_cols = df.select_dtypes(include=[np.number]).columns
# Calculate IQR
Q1 = df[numerical_cols].quantile(0.25)
Q3 = df[numerical_cols].quantile(0.75)
IQR = Q3 - Q1
# Define outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
if method == 'remove':
# Remove outliers
outliers = (df[numerical_cols] < lower_bound) | (df[numerical_cols] > upper_bound)
return df[~outliers.any(axis=1)].copy()
elif method == 'cap':
# Cap outliers
df_capped = df.copy()
for col in numerical_cols:
df_capped.loc[df_capped[col] < lower_bound[col], col] = lower_bound[col]
df_capped.loc[df_capped[col] > upper_bound[col], col] = upper_bound[col]
return df_capped
# Test the function
print("Original shape:", df.shape)
cleaned_df = filter_outliers_iqr(df, method='remove')
print("After removing outliers:", cleaned_df.shape)
capped_df = filter_outliers_iqr(df, method='cap')
print("After capping outliers:", capped_df.shape)
Original shape: (100, 4) After removing outliers: (97, 4) After capping outliers: (100, 4)
Conclusion
Using Pandas with IQR filtering is an effective way to identify and handle outliers in datasets. The IQR method is robust against extreme values and works well with skewed distributions. Choose removal for data errors and capping for valid extreme values.
