Python Pandas - Home
Python Pandas - Introduction
Python Pandas - Environment Setup
Python Pandas - Basics
Python Pandas - Introduction to Data Structures
Python Pandas - Index Objects
Python Pandas - Panel
Python Pandas - Basic Functionality
Python Pandas - Indexing & Selecting Data
Python Pandas - Series
Python Pandas - Series
Python Pandas - Slicing a Series Object
Python Pandas - Attributes of a Series Object
Python Pandas - Arithmetic Operations on Series Object
Python Pandas - Converting Series to Other Objects
Python Pandas - DataFrame
Python Pandas - DataFrame
Python Pandas - Accessing DataFrame
Python Pandas - Slicing a DataFrame Object
Python Pandas - Modifying DataFrame
Python Pandas - Removing Rows from a DataFrame
Python Pandas - Arithmetic Operations on DataFrame
Python Pandas - IO Tools
Python Pandas - IO Tools
Python Pandas - Working with CSV Format
Python Pandas - Reading & Writing JSON Files
Python Pandas - Reading Data from an Excel File
Python Pandas - Writing Data to Excel Files
Python Pandas - Working with HTML Data
Python Pandas - Clipboard
Python Pandas - Working with HDF5 Format
Python Pandas - Comparison with SQL
Python Pandas - Data Handling
Python Pandas - Sorting
Python Pandas - Reindexing
Python Pandas - Iteration
Python Pandas - Concatenation
Python Pandas - Statistical Functions
Python Pandas - Descriptive Statistics
Python Pandas - Working with Text Data
Python Pandas - Function Application
Python Pandas - Options & Customization
Python Pandas - Window Functions
Python Pandas - Aggregations
Python Pandas - Merging/Joining
Python Pandas - MultiIndex
Python Pandas - Basics of MultiIndex
Python Pandas - Indexing with MultiIndex
Python Pandas - Advanced Reindexing with MultiIndex
Python Pandas - Renaming MultiIndex Labels
Python Pandas - Sorting a MultiIndex
Python Pandas - Binary Operations
Python Pandas - Binary Comparison Operations
Python Pandas - Boolean Indexing
Python Pandas - Boolean Masking
Python Pandas - Data Reshaping & Pivoting
Python Pandas - Pivoting
Python Pandas - Stacking & Unstacking
Python Pandas - Melting
Python Pandas - Computing Dummy Variables
Python Pandas - Categorical Data
Python Pandas - Categorical Data
Python Pandas - Ordering & Sorting Categorical Data
Python Pandas - Comparing Categorical Data
Python Pandas - Handling Missing Data
Python Pandas - Missing Data
Python Pandas - Filling Missing Data
Python Pandas - Interpolation of Missing Values
Python Pandas - Dropping Missing Data
Python Pandas - Calculations with Missing Data
Python Pandas - Handling Duplicates
Python Pandas - Duplicated Data
Python Pandas - Counting & Retrieving Unique Elements
Python Pandas - Duplicated Labels
Python Pandas - Grouping & Aggregation
Python Pandas - GroupBy
Python Pandas - Time-series Data
Python Pandas - Date Functionality
Python Pandas - Timedelta
Python Pandas - Sparse Data Structures
Python Pandas - Sparse Data
Python Pandas - Visualization
Python Pandas - Visualization
Python Pandas - Additional Concepts
Python Pandas - Caveats & Gotchas

Python Pandas - Grouping Categorical Data

Quiz

Grouping categorical data in Pandas is a useful technique for summarizing and analyzing datasets. Categorical variables in Pandas are often represented by the Categorical type, which provides an efficient way to handle variables with a limited set of possible values (categories or labels), such as days of the week, gender "male" or "female," or product ratings on a scale such as "poor," "average," and "excellent."

In this tutorial, we will learn how to group categorical data using pandas, focusing on the effect of ordered categories and the observed parameter.

Grouping Categorical Data

The pandas groupby() method can be used to group data by a categorical column, allowing for efficient data aggregation.

Example

Now let's use the groupby() method to summarize the data based on the categorical column. You can use the observed and sort parameters to control how the grouping is done.

import pandas as pd

# Define categorical data
days = pd.Categorical(
values=["Wed", "Mon", "Thu", "Mon", "Wed", "Sat"],
categories=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
ordered=True)

# Create DataFrame
df= pd.DataFrame({
"day": days,
"workers": [3, 4, 1, 4, 2, 2]})

# Display the Input Categorical DataFrame
print("Input Categorical DataFrame:")
print(df)

# Grouping by 'day' with all categories including unused
result = df.groupby("day", observed=False, sort=True).sum()

# Display the Grouped categorical Data
print('\nGrouped categorical Data:')
print(result)

When we run above program, it produces following result −

Input Categorical DataFrame:

	day	workers
0	Wed	3
1	Mon	4
2	Thu	1
3	Mon	4
4	Wed	2
5	Sat	2

Grouped categorical Data:

	workers
day
Mon	8
Tue	0
Wed	5
Thu	1
Fri	0
Sat	2
Sun	0

When you group by categorical variables with observed=False, Pandas will include all categories, even those that are not present in the data like Tue, Fri, and Sun in the above output.

Grouping Only the Observed Categories

If you want to group only the observed categories, you can set observed=True. This will exclude any unused categories from the result.

Example

This example shows grouping only the observed categories in a categorical data using the groupby() method by setting the observed=True.

import pandas as pd

# Define categorical data
days = pd.Categorical(
values=["Wed", "Mon", "Thu", "Mon", "Wed", "Sat"],
categories=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
ordered=True)

# Create DataFrame
df= pd.DataFrame({
"day": days,
"workers": [3, 4, 1, 4, 2, 2]})

# Display the Input Categorical DataFrame
print("Input Categorical DataFrame:")
print(df)

# Grouping by 'day' with only observed categories
result = df.groupby("day", observed=True).sum()

# Display the Grouped categorical Data
print('\nGrouped categorical Data:')
print(result)

While executing the above code we get the following output −

Input Categorical DataFrame:

	day	workers
0	Wed	3
1	Mon	4
2	Thu	1
3	Mon	4
4	Wed	2
5	Sat	2

Grouped categorical Data:

	workers
day
Mon	8
Wed	5
Thu	1
Sat	2

As you can see, categories like Tue, Fri, and Sun are not included in the result because they were not observed in the dataset.

Grouping by Multiple Categorical Columns

You can also group data by multiple categorical columns with the groupby() method, and you can use the observed parameter to control how missing categories are handled.

Example

The following example demonstrates grouping the categorical data by multiple columns using the groupby() method.

import pandas as pd

# Define categorical data
days = pd.Categorical(
values=["Wed", "Mon", "Thu", "Mon", "Wed", "Sat"],
categories=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
ordered=True)

# Creating another Categorical column for 'grade'
grades = pd.Categorical(["good", "good", "very bad", "very good", "very good", "good"], categories=["very bad", "bad", "medium", "good", "very good"])

# Create DataFrame
df= pd.DataFrame({
"day": days,
"workers": [3, 4, 1, 4, 2, 2],
"grades": grades})

# Display the Input Categorical DataFrame
print("Input Categorical DataFrame:")
print(df)

# Grouping Multiple Categorical Columns
result = df.groupby(["day", "grades"], observed=False).sum()

# Display the Grouped categorical Data
print('\nGrouped categorical Data by Multiple Columns:')
print(result)

Following is an output of the above code −

Input Categorical DataFrame:

	day	workers	grades
0	Wed	3	good
1	Mon	4	good
2	Thu	1	very bad
3	Mon	4	very good
4	Wed	2	very good
5	Sat	2	good

Grouped categorical Data by Multiple Columns:

		workers
day	grades
Mon	very bad	0
	bad	0
	medium	0
	good	4
	very good	4
Tue	very bad	0
	bad	0
	medium	0
	good	0
	very good	0
Wed	very bad	0
	bad	0
	medium	0
	good	3
	very good	2
Thu	very bad	1
	bad	0
	medium	0
	good	0
	very good	0
Fri	very bad	0
	bad	0
	medium	0
	good	0
	very good	0
Sat	very bad	0
	bad	0
	medium	0
	good	2
	very good	0
Sun	very bad	0
	bad	0
	medium	0
	good	0
	very good	0

The result includes all combinations of categories from both columns, even if some combinations are missing.

Example: Another example of Grouping Multiple Categorical Columns

The following another example demonstrates how to group data using multiple categorical columns efficiently.

import pandas as pd

# Define grouping columns
group_cols = ['Cat_col1', 'Cat_col2', 'Cat_col3']

# Create DataFrame
df = pd.DataFrame([
    ['A', 'B', 'C', 5.4],
    ['A', 'B', 'D', 11.23],
    ['B', 'A', 'C', 94.12],
    ['B', 'A', 'A', 165.2],
    ['A', 'B', 'D', 565.6]],
    columns=(group_cols + ['Value']))

# Convert columns to categorical types
for col in group_cols:
    df[col] = df[col].astype('category')

# Grouping the data
result = df.groupby(group_cols, as_index=False, observed=True).sum()

# Display the Result
print(result)

When we run above program, it produces following result −

	Cat_col1	Cat_col2	Cat_col3	Value
0	A	B	C	5.40
1	A	B	D	576.83
2	B	A	A	165.20
3	B	A	C	94.12

Print Page