Python Pandas - Grouping Categorical Data



Grouping categorical data in Pandas is a useful technique for summarizing and analyzing datasets. Categorical variables in Pandas are often represented by the Categorical type, which provides an efficient way to handle variables with a limited set of possible values (categories or labels), such as days of the week, gender "male" or "female," or product ratings on a scale such as "poor," "average," and "excellent."

In this tutorial, we will learn how to group categorical data using pandas, focusing on the effect of ordered categories and the observed parameter.

Grouping Categorical Data

The pandas groupby() method can be used to group data by a categorical column, allowing for efficient data aggregation.

Example

Now let's use the groupby() method to summarize the data based on the categorical column. You can use the observed and sort parameters to control how the grouping is done.

import pandas as pd

# Define categorical data
days = pd.Categorical(
values=["Wed", "Mon", "Thu", "Mon", "Wed", "Sat"],
categories=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
ordered=True)

# Create DataFrame
df= pd.DataFrame({
"day": days,
"workers": [3, 4, 1, 4, 2, 2]})

# Display the Input Categorical DataFrame
print("Input Categorical DataFrame:")
print(df)

# Grouping by 'day' with all categories including unused
result = df.groupby("day", observed=False, sort=True).sum()

# Display the Grouped categorical Data
print('\nGrouped categorical Data:')
print(result)

When we run above program, it produces following result −

Input Categorical DataFrame:
day workers
0 Wed 3
1 Mon 4
2 Thu 1
3 Mon 4
4 Wed 2
5 Sat 2
Grouped categorical Data:
workers
day
Mon 8
Tue 0
Wed 5
Thu 1
Fri 0
Sat 2
Sun 0

When you group by categorical variables with observed=False, Pandas will include all categories, even those that are not present in the data like Tue, Fri, and Sun in the above output.

Grouping Only the Observed Categories

If you want to group only the observed categories, you can set observed=True. This will exclude any unused categories from the result.

Example

This example shows grouping only the observed categories in a categorical data using the groupby() method by setting the observed=True.

import pandas as pd

# Define categorical data
days = pd.Categorical(
values=["Wed", "Mon", "Thu", "Mon", "Wed", "Sat"],
categories=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
ordered=True)

# Create DataFrame
df= pd.DataFrame({
"day": days,
"workers": [3, 4, 1, 4, 2, 2]})

# Display the Input Categorical DataFrame
print("Input Categorical DataFrame:")
print(df)

# Grouping by 'day' with only observed categories
result = df.groupby("day", observed=True).sum()

# Display the Grouped categorical Data
print('\nGrouped categorical Data:')
print(result)

While executing the above code we get the following output −

Input Categorical DataFrame:
day workers
0 Wed 3
1 Mon 4
2 Thu 1
3 Mon 4
4 Wed 2
5 Sat 2
Grouped categorical Data:
workers
day
Mon 8
Wed 5
Thu 1
Sat 2

As you can see, categories like Tue, Fri, and Sun are not included in the result because they were not observed in the dataset.

Grouping by Multiple Categorical Columns

You can also group data by multiple categorical columns with the groupby() method, and you can use the observed parameter to control how missing categories are handled.

Example

The following example demonstrates grouping the categorical data by multiple columns using the groupby() method.

import pandas as pd

# Define categorical data
days = pd.Categorical(
values=["Wed", "Mon", "Thu", "Mon", "Wed", "Sat"],
categories=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
ordered=True)

# Creating another Categorical column for 'grade'
grades = pd.Categorical(["good", "good", "very bad", "very good", "very good", "good"], categories=["very bad", "bad", "medium", "good", "very good"])

# Create DataFrame
df= pd.DataFrame({
"day": days,
"workers": [3, 4, 1, 4, 2, 2],
"grades": grades})

# Display the Input Categorical DataFrame
print("Input Categorical DataFrame:")
print(df)

# Grouping Multiple Categorical Columns
result = df.groupby(["day", "grades"], observed=False).sum()

# Display the Grouped categorical Data
print('\nGrouped categorical Data by Multiple Columns:')
print(result)

Following is an output of the above code −

Input Categorical DataFrame:
day workers grades
0 Wed 3 good
1 Mon 4 good
2 Thu 1 very bad
3 Mon 4 very good
4 Wed 2 very good
5 Sat 2 good
Grouped categorical Data by Multiple Columns:
workers
day grades
Mon very bad 0
bad 0
medium 0
good 4
very good 4
Tue very bad 0
bad 0
medium 0
good 0
very good 0
Wed very bad 0
bad 0
medium 0
good 3
very good 2
Thu very bad 1
bad 0
medium 0
good 0
very good 0
Fri very bad 0
bad 0
medium 0
good 0
very good 0
Sat very bad 0
bad 0
medium 0
good 2
very good 0
Sun very bad 0
bad 0
medium 0
good 0
very good 0

The result includes all combinations of categories from both columns, even if some combinations are missing.

Example: Another example of Grouping Multiple Categorical Columns

The following another example demonstrates how to group data using multiple categorical columns efficiently.

import pandas as pd

# Define grouping columns
group_cols = ['Cat_col1', 'Cat_col2', 'Cat_col3']

# Create DataFrame
df = pd.DataFrame([
    ['A', 'B', 'C', 5.4],
    ['A', 'B', 'D', 11.23],
    ['B', 'A', 'C', 94.12],
    ['B', 'A', 'A', 165.2],
    ['A', 'B', 'D', 565.6]],
    columns=(group_cols + ['Value']))

# Convert columns to categorical types
for col in group_cols:
    df[col] = df[col].astype('category')

# Grouping the data
result = df.groupby(group_cols, as_index=False, observed=True).sum()

# Display the Result
print(result)

When we run above program, it produces following result −

Cat_col1 Cat_col2 Cat_col3 Value
0 A B C 5.40
1 A B D 576.83
2 B A A 165.20
3 B A C 94.12
Advertisements