Article Categories

Selected Reading

How to Handle Missing Values of Categorical Variables in Python?

Python Server Side Programming Programming

Missing values are a common occurrence in real-world datasets, and handling them appropriately is crucial for accurate data analysis and modeling. When dealing with categorical variables in Python, there are several approaches to address missing values. In this article, we will explore two practical methods for handling missing values of categorical variables, providing a step-by-step algorithm for each approach.

Syntax

Let's familiarize ourselves with the syntax of the methods we will be using ?

# Syntax for filling missing values using fillna
dataframe['column_name'].fillna(value, inplace=True)

# Syntax for mode calculation
mode_value = dataframe['column_name'].mode()[0]

Algorithm

Step 1 ? Import necessary libraries (pandas, numpy)
Step 2 ? Load or create the dataset
Step 3 ? Identify missing values using isnull()
Step 4 ? Choose handling method (mode or random sampling)
Step 5 ? Apply the chosen method to fill missing values
Step 6 ? Verify missing values are handled

Sample Dataset

Let's create a sample dataset with missing categorical values ?

import pandas as pd
import numpy as np

# Create sample data with missing values
data = {
    'Name': ['John', 'Alice', 'Bob', 'Jane', 'Mike'],
    'Age': [25, 30, 35, 27, 22],
    'Gender': ['Male', 'Female', 'Male', None, 'Male'],
    'Country': ['USA', 'Canada', None, 'UK', 'Germany']
}

df = pd.DataFrame(data)
print("Original Dataset:")
print(df)
print("\nMissing values:")
print(df.isnull().sum())

Original Dataset:
    Name  Age  Gender  Country
0   John   25    Male      USA
1  Alice   30  Female   Canada
2    Bob   35    Male     None
3   Jane   27    None       UK
4   Mike   22    Male  Germany

Missing values:
Name       0
Age        0
Gender     1
Country    1
dtype: int64

Method 1: Mode Imputation

Mode imputation fills missing values with the most frequent category in the column ?

import pandas as pd
import numpy as np

# Create sample data
data = {
    'Name': ['John', 'Alice', 'Bob', 'Jane', 'Mike'],
    'Gender': ['Male', 'Female', 'Male', None, 'Male'],
    'Country': ['USA', 'Canada', None, 'UK', 'Germany']
}

df = pd.DataFrame(data)

# Handle Gender column using mode imputation
gender_mode = df['Gender'].mode()[0]
df['Gender'].fillna(gender_mode, inplace=True)

# Handle Country column using mode imputation
country_mode = df['Country'].mode()[0]
df['Country'].fillna(country_mode, inplace=True)

print("After Mode Imputation:")
print(df)
print("\nMissing values after imputation:")
print(df.isnull().sum())

After Mode Imputation:
    Name  Gender Country
0   John    Male     USA
1  Alice  Female  Canada
2    Bob    Male  Canada
3   Jane    Male      UK
4   Mike    Male Germany

Missing values after imputation:
Name       0
Gender     0
Country    0
dtype: int64

How Mode Imputation Works

Mode imputation works by:

Calculating the most frequent category using mode()[0]
Replacing all missing values with this most frequent value
Preserving the original distribution pattern of the data

Method 2: Random Sampling

Random sampling replaces missing values with randomly selected categories from existing unique values ?

import pandas as pd
import numpy as np

# Create sample data
data = {
    'Name': ['John', 'Alice', 'Bob', 'Jane', 'Mike', 'Sarah'],
    'Gender': ['Male', 'Female', 'Male', None, 'Male', None],
    'Country': ['USA', 'Canada', None, 'UK', 'Germany', None]
}

df = pd.DataFrame(data)
print("Original data:")
print(df)

# Handle Gender using random sampling
gender_missing_indices = df[df['Gender'].isnull()].index
gender_unique_values = df['Gender'].dropna().unique()

# Fill missing Gender values with random sampling
for idx in gender_missing_indices:
    df.loc[idx, 'Gender'] = np.random.choice(gender_unique_values)

# Handle Country using random sampling  
country_missing_indices = df[df['Country'].isnull()].index
country_unique_values = df['Country'].dropna().unique()

# Fill missing Country values with random sampling
for idx in country_missing_indices:
    df.loc[idx, 'Country'] = np.random.choice(country_unique_values)

print("\nAfter Random Sampling:")
print(df)
print("\nMissing values after imputation:")
print(df.isnull().sum())

Original data:
    Name  Gender Country
0   John    Male     USA
1  Alice  Female  Canada
2    Bob    Male     None
3   Jane    None      UK
4   Mike    Male Germany
5  Sarah    None     None

After Random Sampling:
    Name  Gender Country
0   John    Male     USA
1  Alice  Female  Canada
2    Bob    Male  Germany
3   Jane  Female      UK
4   Mike    Male Germany
5  Sarah    Male     USA

Missing values after imputation:
Name       0
Gender     0
Country    0
dtype: int64

Comparison

Method	Approach	Best For	Drawback
Mode Imputation	Most frequent value	Preserving distribution	Can introduce bias
Random Sampling	Random from existing values	Adding variability	May change distribution

Key Considerations

Data Distribution: Mode imputation preserves the original distribution better
Bias: Random sampling reduces potential bias from over-representation
Sample Size: Mode imputation works better with larger datasets
Analysis Goals: Choose based on whether you prioritize distribution preservation or variability

Conclusion

Both mode imputation and random sampling are effective methods for handling missing categorical values in Python. Mode imputation is ideal when preserving the original data distribution is important, while random sampling provides more variability and reduces bias. Choose the method that best aligns with your data analysis objectives and dataset characteristics.

Way2Class

Updated on: 2026-03-27T10:14:37+05:30

4K+ Views

Previous Next