Article Categories

Selected Reading

How to Use Pandas filter with IQR?

Python Pandas Server Side Programming Programming

Pandas is an open-source Python library used for data analysis and manipulation. In large datasets, some extreme values called outliers can modify the data analysis result. The Interquartile Range (IQR) is a robust statistical measure used to identify and handle these outliers.

Understanding the Interquartile Range (IQR)

The IQR divides a dataset into quartiles, where Q1 represents the 25th percentile and Q3 represents the 75th percentile. The IQR is calculated as the difference between Q3 and Q1 ?

# IQR Formula
IQR = Q3 - Q1

Any value below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR is considered an outlier.

Creating Sample Data

Let's create sample housing data to demonstrate IQR filtering ?

import pandas as pd
import numpy as np

# Create sample housing data
np.random.seed(42)
data = {
    'price': np.random.normal(5000000, 1500000, 100).astype(int),
    'area': np.random.normal(5000, 2000, 100).astype(int),
    'bedrooms': np.random.randint(1, 6, 100),
    'bathrooms': np.random.randint(1, 4, 100)
}

# Add some outliers
data['price'][0] = 15000000  # High outlier
data['area'][1] = 15000      # High outlier
data['price'][2] = 1000000   # Low outlier

df = pd.DataFrame(data)
print("Original dataset shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())

Original dataset shape: (100, 4)

First 5 rows:
      price  area  bedrooms  bathrooms
0  15000000  4967         4          1
1   4434262 15000         2          1
2   1000000  6951         4          3
3   6396737  1004         2          1
4   4836637  7395         2          2

Computing the IQR

Calculate the quartiles and IQR for numerical columns ?

# Calculate quartiles for numerical columns only
numerical_cols = df.select_dtypes(include=[np.number]).columns
print("Numerical columns:", list(numerical_cols))

Q1 = df[numerical_cols].quantile(0.25)
Q3 = df[numerical_cols].quantile(0.75)
IQR = Q3 - Q1

print("\nQ1 (25th percentile):")
print(Q1)
print("\nQ3 (75th percentile):")
print(Q3)
print("\nIQR:")
print(IQR)

Numerical columns: ['price', 'area', 'bedrooms', 'bathrooms']

Q1 (25th percentile):
price        3644906.50
area         3616.25
bedrooms        2.00
bathrooms       1.00
Name: 0.25, dtype: float64

Q3 (75th percentile):
price        6181262.25
area         6244.50
bedrooms        4.00
bathrooms       2.00
Name: 0.75, dtype: float64

IQR:
price        2536355.75
area         2628.25
bedrooms        2.00
bathrooms       1.00
Name: 0.5, dtype: float64

Identifying Outliers

Create boolean masks to identify outliers using the IQR method ?

# Calculate outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print("Lower bounds:")
print(lower_bound)
print("\nUpper bounds:")
print(upper_bound)

# Identify outliers
outliers = (df[numerical_cols] < lower_bound) | (df[numerical_cols] > upper_bound)
print(f"\nNumber of outliers per column:")
print(outliers.sum())

# Find rows with any outliers
outlier_rows = outliers.any(axis=1)
print(f"\nTotal rows with outliers: {outlier_rows.sum()}")
print("Outlier rows:")
print(df[outlier_rows][['price', 'area']])

Lower bounds:
price        -161627.125
area           -327.125
bedrooms        -1.000
bathrooms       -0.500
Name: 0.25, dtype: float64

Upper bounds:
price        9985796.375
area         9188.875
bedrooms         7.000
bathrooms        3.500
Name: 0.75, dtype: float64

Number of outliers per column:
price       2
area        1
bedrooms    0
bathrooms   0
dtype: int64

Total rows with outliers: 3
Outlier rows:
      price  area
0  15000000  4967
1   4434262 15000
3   6396737  1004

Method 1: Removing Outliers

Remove rows containing outliers from the dataset ?

# Remove rows with outliers
clean_data = df[~outlier_rows].copy()

print("Original dataset shape:", df.shape)
print("Cleaned dataset shape:", clean_data.shape)
print(f"Removed {df.shape[0] - clean_data.shape[0]} rows")

print("\nFirst 5 rows after cleaning:")
print(clean_data.head())

Original dataset shape: (100, 4)
Cleaned dataset shape: (97, 3)
Removed 3 rows

First 5 rows after cleaning:
      price  area  bedrooms  bathrooms
2   1000000  6951         4          3
4   4836637  7395         2          2
5   8917445  6405         4          2
6   1946615  4326         3          1
7   4019269  1775         2          2

Method 2: Capping Outliers

Replace outliers with boundary values instead of removing them ?

# Create a copy for capping
capped_data = df.copy()

# Cap outliers for each numerical column
for col in numerical_cols:
    # Cap low outliers
    capped_data.loc[capped_data[col] < lower_bound[col], col] = lower_bound[col]
    # Cap high outliers  
    capped_data.loc[capped_data[col] > upper_bound[col], col] = upper_bound[col]

print("Before capping - outlier values:")
print("Max price:", df['price'].max())
print("Max area:", df['area'].max())

print("\nAfter capping - outlier values:")
print("Max price:", capped_data['price'].max())
print("Max area:", capped_data['area'].max())

Before capping - outlier values:
Max price: 15000000
Max area: 15000

After capping - outlier values:
Max price: 9985796
Max area: 9188

Comparison of Methods

Method	Rows Kept	Data Integrity	Best For
Remove Outliers	97/100	High (pure data)	Data entry errors
Cap Outliers	100/100	Medium (modified)	Valid extreme values

Complete Example

import pandas as pd
import numpy as np

def filter_outliers_iqr(df, method='remove'):
    """
    Filter outliers using IQR method
    
    Parameters:
    df: DataFrame to filter
    method: 'remove' or 'cap'
    
    Returns:
    Filtered DataFrame
    """
    # Select only numerical columns
    numerical_cols = df.select_dtypes(include=[np.number]).columns
    
    # Calculate IQR
    Q1 = df[numerical_cols].quantile(0.25)
    Q3 = df[numerical_cols].quantile(0.75)
    IQR = Q3 - Q1
    
    # Define outlier boundaries
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    if method == 'remove':
        # Remove outliers
        outliers = (df[numerical_cols] < lower_bound) | (df[numerical_cols] > upper_bound)
        return df[~outliers.any(axis=1)].copy()
    
    elif method == 'cap':
        # Cap outliers
        df_capped = df.copy()
        for col in numerical_cols:
            df_capped.loc[df_capped[col] < lower_bound[col], col] = lower_bound[col]
            df_capped.loc[df_capped[col] > upper_bound[col], col] = upper_bound[col]
        return df_capped

# Test the function
print("Original shape:", df.shape)
cleaned_df = filter_outliers_iqr(df, method='remove')
print("After removing outliers:", cleaned_df.shape)

capped_df = filter_outliers_iqr(df, method='cap')
print("After capping outliers:", capped_df.shape)

Original shape: (100, 4)
After removing outliers: (97, 4)
After capping outliers: (100, 4)

Conclusion

Using Pandas with IQR filtering is an effective way to identify and handle outliers in datasets. The IQR method is robust against extreme values and works well with skewed distributions. Choose removal for data errors and capping for valid extreme values.

Rohan Singh

Updated on: 2026-03-27T15:16:07+05:30

3K+ Views

Previous Next