How to Handle Missing Values of Categorical Variables in Python?


Missing qualities are a typical event in genuine world datasets, and taking care of them fittingly is critical for exact information examination and demonstrating. While managing all out factors in Python, there are a few ways to deal with address missing qualities. In this article, we will investigate two viable techniques for dealing with missing upsides of unmitigated factors, giving a bit by bit calculation for each methodology. Moreover, we will introduce genuine and executable Python code guides to show the execution of these strategies.

Syntax

let's familiarize ourselves with the syntax of the method we will be using −

# Syntax for filling missing values using method_name
dataframe['column_name'].fillna(method_name, inplace=True)

Algorithm

  • Step 1 − Import necessary libraries

  • Step 2 − Load the data

  • Step 3 − Identify missing values

  • Step 4 − Handling missing values

  • Step 5 − Verify missing values

  • Step 6 − Perform further analysis

Dataset Taken

Name,Age,Gender,Country
John,25,Male,USA
Alice,30,Female,Canada
Bob,35,Male,
Jane,27,
Mike,22,Male,Germany

Approach 1: Mode Imputation

Identify the categorical column(s) containing missing values in your dataset.

Compute the mode (the most frequent value) of the respective column(s) using the mode() function.

Fill the missing values with the computed mode using the fillna() method with the method_name parameter set to 'mode'.

Example

import pandas as pd

# Load the dataset
data = pd.read_csv('your_dataset.csv')

# Identify the column(s) with missing values
column_with_missing_values = 'Country'

# Compute the mode
mode_value = data[column_with_missing_values].mode()[0]

# Fill the missing values with mode
data[column_with_missing_values].fillna(mode_value, inplace=True)

# Verify the changes
print(data[column_with_missing_values].isnull().sum())

Output

0

Explanation

Mode imputation is a common method for handling missing values in categorical variables. It involves filling the missing values with the mode, which represents the most frequent category in the column. Here is an itemized clarification of the way this functions −

Recognize the straight out column(s) containing missing qualities in your dataset − First, you really want to distinguish the column(s) where the missing qualities are available. These sections will be the focal point of the mode attribution process.

Process the method of the individual column(s) utilizing the mode() capability − Whenever you've recognized the column(s) with missing qualities, you can compute the method of every section utilizing the mode() capability. The mode addresses the class that happens most often in the section.

Fill the missing values with the computed mode using the fillna() method − After determining the mode, you can proceed to fill the missing values in the categorical column(s) with the computed mode. This can be achieved using the fillna() method in Python, specifying the method_name parameter as 'mode'. By setting inplace=True, the changes will be applied directly to the dataset.

Mode imputation is a straightforward and intuitive approach to handling missing values in categorical variables. By filling the missing qualities with the most continuous classification, it guarantees that the general dissemination of classifications in the section remains moderately unaltered. Nonetheless, it is essential to take note that this approach might present an inclination in the event that the missing qualities are not missing aimlessly. Also, in situations where there are various sections with missing qualities, every segment ought to be handled independently.

Approach 2: Random Sampling

  • Identify the categorical column(s) with missing values.

  • Generate random indices corresponding to the missing values using the numpy library.

  • Create a list of unique categories in the column(s) using the unique() function.

  • Replace the missing values with randomly sampled categories from the list using the fillna() method.

Example

import pandas as pd
import numpy as np

# Load the dataset
data = pd.read_csv('your_dataset.csv')

# Identify the column with missing values
column_with_missing_values = 'Gender'

# Generate random indices for missing values
missing_indices = data[data[column_with_missing_values].isnull()].index

# Get unique categories in the column
unique_categories = data[column_with_missing_values].unique()

# Replace missing values with random sampling
data.loc[missing_indices, column_with_missing_values] = np.random.choice(unique_categories, len(missing_indices))

# Verify the changes
print(data[column_with_missing_values].isnull().sum())

Output

0

Explanation

Random sampling is an alternative approach to handle missing values in categorical variables. Instead of imputing missing values with the mode, this approach involves replacing the missing values with randomly sampled categories from the existing unique categories within the column. Here is a definite clarification of the way this functions −

Distinguish the absolute column(s) with missing qualities − Begin by recognizing the column(s) in your dataset that contain missing qualities. These are the segments that will be the focal point of the arbitrary inspecting process.

Produce arbitrary records for missing qualities − Next, create irregular lists that relate to the missing qualities in the absolute column(s). This can be accomplished utilizing libraries, for example, numpy.

Create a list of unique categories in the column − Extract the unique categories present in the categorical column(s) with missing values. This list will be used for random sampling.

Replace missing values with random sampling − With the random indices and the list of unique categories, replace the missing values in the categorical column(s) by randomly sampling categories from the list. This can be done using the fillna() method, assigning the randomly sampled categories to the missing values at the specified indices.

Random sampling provides a flexible approach to handling missing values in categorical variables. By randomly assigning categories, it allows for variability in the imputed values and avoids introducing bias that might arise from imputing with the mode. However, it is important to consider that random sampling might change the distribution of categories in the column, potentially affecting subsequent analysis or modeling tasks. Additionally, as with mode imputation, each column with missing values should be processed independently.

Both mode imputation and random sampling offer viable approaches for handling missing values in categorical variables. The decision between the two relies upon the particular attributes of the dataset and the objectives of the investigation. Assessing the possible effect of each methodology on the respectability and unwavering quality of the information prior to going with a choice is fundamental.

Conclusion

Taking care of missing values is a fundamental stage in information preprocessing, and while working with absolute factors in Python, two powerful methodologies can be utilized − mode ascription and arbitrary examining. The mode imputation method fills missing values with the most frequent category, while the random sampling approach replaces missing values with randomly selected categories from the existing unique categories. By utilizing these methods, data analysts and data scientists can ensure the integrity and accuracy of their categorical data. Remember to adapt these techniques to suit your specific dataset and always evaluate the impact of the chosen approach on your analysis.

Updated on: 27-Jul-2023

2K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements