Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to Handle Missing Values of Categorical Variables in Python?
Missing values are a common occurrence in real-world datasets, and handling them appropriately is crucial for accurate data analysis and modeling. When dealing with categorical variables in Python, there are several approaches to address missing values. In this article, we will explore two practical methods for handling missing values of categorical variables, providing a step-by-step algorithm for each approach.
Syntax
Let's familiarize ourselves with the syntax of the methods we will be using ?
# Syntax for filling missing values using fillna dataframe['column_name'].fillna(value, inplace=True) # Syntax for mode calculation mode_value = dataframe['column_name'].mode()[0]
Algorithm
Step 1 ? Import necessary libraries (pandas, numpy)
Step 2 ? Load or create the dataset
Step 3 ? Identify missing values using isnull()
Step 4 ? Choose handling method (mode or random sampling)
Step 5 ? Apply the chosen method to fill missing values
Step 6 ? Verify missing values are handled
Sample Dataset
Let's create a sample dataset with missing categorical values ?
import pandas as pd
import numpy as np
# Create sample data with missing values
data = {
'Name': ['John', 'Alice', 'Bob', 'Jane', 'Mike'],
'Age': [25, 30, 35, 27, 22],
'Gender': ['Male', 'Female', 'Male', None, 'Male'],
'Country': ['USA', 'Canada', None, 'UK', 'Germany']
}
df = pd.DataFrame(data)
print("Original Dataset:")
print(df)
print("\nMissing values:")
print(df.isnull().sum())
Original Dataset:
Name Age Gender Country
0 John 25 Male USA
1 Alice 30 Female Canada
2 Bob 35 Male None
3 Jane 27 None UK
4 Mike 22 Male Germany
Missing values:
Name 0
Age 0
Gender 1
Country 1
dtype: int64
Method 1: Mode Imputation
Mode imputation fills missing values with the most frequent category in the column ?
import pandas as pd
import numpy as np
# Create sample data
data = {
'Name': ['John', 'Alice', 'Bob', 'Jane', 'Mike'],
'Gender': ['Male', 'Female', 'Male', None, 'Male'],
'Country': ['USA', 'Canada', None, 'UK', 'Germany']
}
df = pd.DataFrame(data)
# Handle Gender column using mode imputation
gender_mode = df['Gender'].mode()[0]
df['Gender'].fillna(gender_mode, inplace=True)
# Handle Country column using mode imputation
country_mode = df['Country'].mode()[0]
df['Country'].fillna(country_mode, inplace=True)
print("After Mode Imputation:")
print(df)
print("\nMissing values after imputation:")
print(df.isnull().sum())
After Mode Imputation:
Name Gender Country
0 John Male USA
1 Alice Female Canada
2 Bob Male Canada
3 Jane Male UK
4 Mike Male Germany
Missing values after imputation:
Name 0
Gender 0
Country 0
dtype: int64
How Mode Imputation Works
Mode imputation works by:
Calculating the most frequent category using
mode()[0]Replacing all missing values with this most frequent value
Preserving the original distribution pattern of the data
Method 2: Random Sampling
Random sampling replaces missing values with randomly selected categories from existing unique values ?
import pandas as pd
import numpy as np
# Create sample data
data = {
'Name': ['John', 'Alice', 'Bob', 'Jane', 'Mike', 'Sarah'],
'Gender': ['Male', 'Female', 'Male', None, 'Male', None],
'Country': ['USA', 'Canada', None, 'UK', 'Germany', None]
}
df = pd.DataFrame(data)
print("Original data:")
print(df)
# Handle Gender using random sampling
gender_missing_indices = df[df['Gender'].isnull()].index
gender_unique_values = df['Gender'].dropna().unique()
# Fill missing Gender values with random sampling
for idx in gender_missing_indices:
df.loc[idx, 'Gender'] = np.random.choice(gender_unique_values)
# Handle Country using random sampling
country_missing_indices = df[df['Country'].isnull()].index
country_unique_values = df['Country'].dropna().unique()
# Fill missing Country values with random sampling
for idx in country_missing_indices:
df.loc[idx, 'Country'] = np.random.choice(country_unique_values)
print("\nAfter Random Sampling:")
print(df)
print("\nMissing values after imputation:")
print(df.isnull().sum())
Original data:
Name Gender Country
0 John Male USA
1 Alice Female Canada
2 Bob Male None
3 Jane None UK
4 Mike Male Germany
5 Sarah None None
After Random Sampling:
Name Gender Country
0 John Male USA
1 Alice Female Canada
2 Bob Male Germany
3 Jane Female UK
4 Mike Male Germany
5 Sarah Male USA
Missing values after imputation:
Name 0
Gender 0
Country 0
dtype: int64
Comparison
| Method | Approach | Best For | Drawback |
|---|---|---|---|
| Mode Imputation | Most frequent value | Preserving distribution | Can introduce bias |
| Random Sampling | Random from existing values | Adding variability | May change distribution |
Key Considerations
Data Distribution: Mode imputation preserves the original distribution better
Bias: Random sampling reduces potential bias from over-representation
Sample Size: Mode imputation works better with larger datasets
Analysis Goals: Choose based on whether you prioritize distribution preservation or variability
Conclusion
Both mode imputation and random sampling are effective methods for handling missing categorical values in Python. Mode imputation is ideal when preserving the original data distribution is important, while random sampling provides more variability and reduces bias. Choose the method that best aligns with your data analysis objectives and dataset characteristics.
