Python Pandas - Grouping Categorical Data



Grouping categorical data in Pandas is a useful technique for summarizing and analyzing datasets. Categorical variables in Pandas are often represented by the Categorical type, which provides an efficient way to handle variables with a limited set of possible values (categories or labels), such as days of the week, gender "male" or "female," or product ratings on a scale such as "poor," "average," and "excellent."

In this tutorial, we will learn how to group categorical data using pandas, focusing on the effect of ordered categories and the observed parameter.

Grouping Categorical Data

The pandas groupby() method can be used to group data by a categorical column, allowing for efficient data aggregation.

Example

Now let's use the groupby() method to summarize the data based on the categorical column. You can use the observed and sort parameters to control how the grouping is done.

import pandas as pd

# Define categorical data
days = pd.Categorical(
values=["Wed", "Mon", "Thu", "Mon", "Wed", "Sat"],
categories=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
ordered=True)

# Create DataFrame
df= pd.DataFrame({
"day": days,
"workers": [3, 4, 1, 4, 2, 2]})

# Display the Input Categorical DataFrame
print("Input Categorical DataFrame:")
print(df)

# Grouping by 'day' with all categories including unused
result = df.groupby("day", observed=False, sort=True).sum()

# Display the Grouped categorical Data
print('\nGrouped categorical Data:')
print(result)

When we run above program, it produces following result −

Input Categorical DataFrame:
dayworkers
0Wed3
1Mon4
2Thu1
3Mon4
4Wed2
5Sat2
Grouped categorical Data:
workers
day
Mon8
Tue0
Wed5
Thu1
Fri0
Sat2
Sun0

When you group by categorical variables with observed=False, Pandas will include all categories, even those that are not present in the data like Tue, Fri, and Sun in the above output.

Grouping Only the Observed Categories

If you want to group only the observed categories, you can set observed=True. This will exclude any unused categories from the result.

Example

This example shows grouping only the observed categories in a categorical data using the groupby() method by setting the observed=True.

import pandas as pd

# Define categorical data
days = pd.Categorical(
values=["Wed", "Mon", "Thu", "Mon", "Wed", "Sat"],
categories=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
ordered=True)

# Create DataFrame
df= pd.DataFrame({
"day": days,
"workers": [3, 4, 1, 4, 2, 2]})

# Display the Input Categorical DataFrame
print("Input Categorical DataFrame:")
print(df)

# Grouping by 'day' with only observed categories
result = df.groupby("day", observed=True).sum()

# Display the Grouped categorical Data
print('\nGrouped categorical Data:')
print(result)

While executing the above code we get the following output −

Input Categorical DataFrame:
dayworkers
0Wed3
1Mon4
2Thu1
3Mon4
4Wed2
5Sat2
Grouped categorical Data:
workers
day
Mon8
Wed5
Thu1
Sat2

As you can see, categories like Tue, Fri, and Sun are not included in the result because they were not observed in the dataset.

Grouping by Multiple Categorical Columns

You can also group data by multiple categorical columns with the groupby() method, and you can use the observed parameter to control how missing categories are handled.

Example

The following example demonstrates grouping the categorical data by multiple columns using the groupby() method.

import pandas as pd

# Define categorical data
days = pd.Categorical(
values=["Wed", "Mon", "Thu", "Mon", "Wed", "Sat"],
categories=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
ordered=True)

# Creating another Categorical column for 'grade'
grades = pd.Categorical(["good", "good", "very bad", "very good", "very good", "good"], categories=["very bad", "bad", "medium", "good", "very good"])

# Create DataFrame
df= pd.DataFrame({
"day": days,
"workers": [3, 4, 1, 4, 2, 2],
"grades": grades})

# Display the Input Categorical DataFrame
print("Input Categorical DataFrame:")
print(df)

# Grouping Multiple Categorical Columns
result = df.groupby(["day", "grades"], observed=False).sum()

# Display the Grouped categorical Data
print('\nGrouped categorical Data by Multiple Columns:')
print(result)

Following is an output of the above code −

Input Categorical DataFrame:
dayworkersgrades
0Wed3good
1Mon4good
2Thu1very bad
3Mon4very good
4Wed2very good
5Sat2good
Grouped categorical Data by Multiple Columns:
workers
daygrades
Monvery bad0
bad0
medium0
good4
very good4
Tuevery bad0
bad0
medium0
good0
very good0
Wedvery bad0
bad0
medium0
good3
very good2
Thuvery bad1
bad0
medium0
good0
very good0
Frivery bad0
bad0
medium0
good0
very good0
Satvery bad0
bad0
medium0
good2
very good0
Sunvery bad0
bad0
medium0
good0
very good0

The result includes all combinations of categories from both columns, even if some combinations are missing.

Example: Another example of Grouping Multiple Categorical Columns

The following another example demonstrates how to group data using multiple categorical columns efficiently.

import pandas as pd

# Define grouping columns
group_cols = ['Cat_col1', 'Cat_col2', 'Cat_col3']

# Create DataFrame
df = pd.DataFrame([
    ['A', 'B', 'C', 5.4],
    ['A', 'B', 'D', 11.23],
    ['B', 'A', 'C', 94.12],
    ['B', 'A', 'A', 165.2],
    ['A', 'B', 'D', 565.6]],
    columns=(group_cols + ['Value']))

# Convert columns to categorical types
for col in group_cols:
    df[col] = df[col].astype('category')

# Grouping the data
result = df.groupby(group_cols, as_index=False, observed=True).sum()

# Display the Result
print(result)

When we run above program, it produces following result −

Cat_col1Cat_col2Cat_col3Value
0ABC5.40
1ABD576.83
2BAA165.20
3BAC94.12
Advertisements