Python Pandas - Removing Unused Categories



Removing unused categories from categorical data is useful for cleaning and optimizing datasets. In pandas, categorical data is a powerful tool for managing data with fixed, limited values and represented using the Categorical type. It provides specialized methods for handling categorical data through the Series.cat accessor. One such method is remove_unused_categories(), which removes unused categories from a categorical object.

In this tutorial, we will learn about Removing Unused categories to the Pandas categorical data using its related functionalities with the various examples.

The remove_unused_categories() Method

The Pandas Series.cat.remove_unused_categories() method removes categories that are not used in the data from a Pandas categorical object while maintaining its original data and order.

Syntax

Following is the syntax of this method −

Series.cat.remove_unused_categories(*args, **kwargs)

This method does not require any mandatory parameters and removes only those categories that are not present in the data.

Removing Unused Categories from a Series

You can remove the unused categories from a Pandas categorical series object directly by using the remove_unused_categories() method.

Example

This example demonstrates how to remove unused categories from a categorical Series. using the Pandas Series.cat.remove_unused_categories() method.

import pandas as pd

# Creating a categorical Series
s = pd.Series(["cat", "dog", "cat"], dtype="category")
s = s.cat.add_categories(["mouse", "elephant"])

print("Original Series:")
print(s)

# Removing unused categories
s = s.cat.remove_unused_categories()

print("\nSeries after removing unused categories:")
print(s)

When we run above program, it produces following result −

Original Series:
0    cat
1    dog
2    cat
dtype: category
Categories (4, object): ['cat', 'dog', 'mouse', 'elephant']

Series after removing unused categories:
0    cat
1    dog
2    cat
dtype: category
Categories (2, object): ['cat', 'dog']

Removing Unused Categories from a DataFrame Column

You can also remove unused categories from a DataFrame column using the cat.remove_unused_categories() method.

Example

This example demonstrates how to remove unused categories from a specific column in a DataFrame.

import pandas as pd

import pandas as pd

# Creating a DataFrame with a categorical column
df = pd.DataFrame({"Animal": ["Cat", "Dog", "Mouse"],
"Category": pd.Series(["A", "B", "A"], dtype="category")
})

# Add extra categories 
df["Category"] = df["Category"].cat.add_categories(["C", "D"])

print("Original DataFrame:")
print(df['Category'].cat.categories)

# Removing unused categories from the 'Category' column
df["Category"] = df["Category"].cat.remove_unused_categories()

print("\nDataFrame after removing unused categories:")
print(df)

# Checking the updated categories
print("\nUpdated categories in 'Category' column:")
print(df["Category"].cat.categories)

While executing the above code we get the following output −

Original DataFrame:
Index(['A', 'B', 'C', 'D'], dtype='object')

DataFrame after removing unused categories:
Animal Category
0 Cat A
1 Dog B
2 Mouse A
Updated categories in 'Category' column: Index(['A', 'B'], dtype='object')

Removing Unused Categories with groupby

Unused categories in categorical data can also be dropped while performing groupby operations. This approach is particularly useful when you need to aggregate data based on the reduced set of categories.

Example

This example demonstrates how to remove unused categories to a specific column in a DataFrame and applying the grouping operation.

import pandas as pd

# Creating a DataFrame with a categorical column
df = pd.DataFrame({
"Value": [10, 15, 10, 20],
"Category": pd.Categorical(["A", "B", "A", "C"], categories=["A", "B", "C", "D"])
})

# Display the input DataFrame
print("Original DataFrame:")
print(df)

# Removing unused categories
df['Category'] = df['Category'].cat.remove_unused_categories()

# Grouping by 'Category'
grouped = df.groupby('Category').mean()

# Display the grouped DataFrame
print("\nGrouped DataFrame after removing unused categories:")
print(grouped)

When we run above program, it produces following result −

Original DataFrame:
Value Category
0 10 A
1 15 B
2 10 A
3 20 C
Grouped DataFrame after removing unused categories: Value Category A 10.0 B 15.0 C 20.0
Advertisements