Article Categories

Selected Reading

How to handle missing data using seaborn?

Python Seaborn Data Visualization

Seaborn is primarily a visualization library and does not provide direct methods to handle missing data. However, Seaborn works seamlessly with pandas, which provides powerful tools to handle missing data, and we can then use Seaborn to visualize the cleaned data.

By combining the data manipulation capabilities of pandas for handling missing data with the visualization capabilities of Seaborn, we can clean our data and create meaningful visualizations to gain insights from our dataset.

Import Required Libraries

First, we need to import the necessary libraries in our Python environment ?

import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Create a Dataset with Missing Values

Let's create a sample dataset that contains missing values to demonstrate the handling techniques ?

import pandas as pd
import numpy as np

# Create dataset with missing values
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, np.nan, 35, 28, np.nan],
    'Salary': [50000, 60000, np.nan, 75000, 65000],
    'Department': ['HR', 'IT', np.nan, 'Finance', 'IT']
}

df = pd.DataFrame(data)
print(df)

      Name   Age   Salary Department
0    Alice  25.0  50000.0         HR
1      Bob   NaN  60000.0         IT
2  Charlie  35.0      NaN       None
3    David  28.0  75000.0    Finance
4      Eve   NaN  65000.0         IT

Identify Missing Data

Pandas provides methods to identify missing data in a DataFrame. The isnull() function returns a DataFrame with True values where data is missing ?

import pandas as pd
import numpy as np

# Create dataset with missing values
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, np.nan, 35, 28, np.nan],
    'Salary': [50000, 60000, np.nan, 75000, 65000],
    'Department': ['HR', 'IT', np.nan, 'Finance', 'IT']
}

df = pd.DataFrame(data)

# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())

print("\nDetailed missing value information:")
print(df.info())

Missing values per column:
Name          0
Age           2
Salary        1
Department    1
dtype: int64

Detailed missing value information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Name        5 non-null      object 
 1   Age         3 non-null      float64
 2   Salary      4 non-null      float64
 3   Department  4 non-null      object 
dtypes: float64(2), object(2)
memory usage: 288.0+ bytes
None

Handle Missing Data

Method 1: Removing Missing Data

If missing data is minimal, we can remove rows or columns using dropna() ?

import pandas as pd
import numpy as np

# Create dataset with missing values
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, np.nan, 35, 28, np.nan],
    'Salary': [50000, 60000, np.nan, 75000, 65000],
    'Department': ['HR', 'IT', np.nan, 'Finance', 'IT']
}

df = pd.DataFrame(data)

# Drop rows with any missing values
df_drop_rows = df.dropna()
print("After dropping rows:")
print(df_drop_rows)

print("\nOriginal shape:", df.shape)
print("After dropping rows shape:", df_drop_rows.shape)

After dropping rows:
    Name   Age   Salary Department
0  Alice  25.0  50000.0         HR
3  David  28.0  75000.0    Finance

Original shape: (5, 4)
After dropping rows shape: (2, 4)

Method 2: Imputing Missing Data

For significant missing data, we can fill missing values with mean, median, mode, or custom values ?

import pandas as pd
import numpy as np

# Create dataset with missing values
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, np.nan, 35, 28, np.nan],
    'Salary': [50000, 60000, np.nan, 75000, 65000],
    'Department': ['HR', 'IT', np.nan, 'Finance', 'IT']
}

df = pd.DataFrame(data)

# Create a copy for imputation
df_imputed = df.copy()

# Fill missing Age with mean
df_imputed['Age'].fillna(df_imputed['Age'].mean(), inplace=True)

# Fill missing Salary with median
df_imputed['Salary'].fillna(df_imputed['Salary'].median(), inplace=True)

# Fill missing Department with mode
df_imputed['Department'].fillna(df_imputed['Department'].mode()[0], inplace=True)

print("After imputation:")
print(df_imputed)

After imputation:
      Name        Age   Salary Department
0    Alice  25.000000  50000.0         HR
1      Bob  29.333333  60000.0         IT
2  Charlie  35.000000  62500.0         IT
3    David  28.000000  75000.0    Finance
4      Eve  29.333333  65000.0         IT

Visualize Data with Missing Values Using Seaborn

Seaborn can help us visualize missing data patterns and compare before/after cleaning ?

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Create dataset with missing values
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, np.nan, 35, 28, np.nan],
    'Salary': [50000, 60000, np.nan, 75000, 65000],
    'Department': ['HR', 'IT', np.nan, 'Finance', 'IT']
}

df = pd.DataFrame(data)

# Create imputed version
df_imputed = df.copy()
df_imputed['Age'].fillna(df_imputed['Age'].mean(), inplace=True)
df_imputed['Salary'].fillna(df_imputed['Salary'].median(), inplace=True)
df_imputed['Department'].fillna(df_imputed['Department'].mode()[0], inplace=True)

# Create visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Plot missing data heatmap
sns.heatmap(df.isnull(), cbar=True, yticklabels=False, 
            cmap='viridis', ax=axes[0])
axes[0].set_title('Missing Data Pattern')

# Plot cleaned data
sns.scatterplot(data=df_imputed, x='Age', y='Salary', 
                hue='Department', ax=axes[1])
axes[1].set_title('Cleaned Data Visualization')

plt.tight_layout()
plt.show()

Comparison of Approaches

Method	Pros	Cons	Best For
`dropna()`	Simple, preserves data integrity	Loses information, reduces dataset size	Small amount of missing data
Mean/Median imputation	Preserves dataset size	May introduce bias	Numerical data with random missingness
Mode imputation	Works for categorical data	May not represent true distribution	Categorical data

Conclusion

While Seaborn doesn't directly handle missing data, combining pandas data cleaning with Seaborn visualization creates a powerful workflow. Choose dropna() for minimal missing data or imputation methods for preserving dataset size when missing data is significant.

Niharika Aitam

Updated on: 2026-03-27T10:52:43+05:30

1K+ Views

Previous Next