- Python Pandas - Home
- Python Pandas - Introduction
- Python Pandas - Environment Setup
- Python Pandas - Basics
- Python Pandas - Introduction to Data Structures
- Python Pandas - Index Objects
- Python Pandas - Panel
- Python Pandas - Basic Functionality
- Python Pandas - Indexing & Selecting Data
- Python Pandas - Series
- Python Pandas - Series
- Python Pandas - Slicing a Series Object
- Python Pandas - Attributes of a Series Object
- Python Pandas - Arithmetic Operations on Series Object
- Python Pandas - Converting Series to Other Objects
- Python Pandas - DataFrame
- Python Pandas - DataFrame
- Python Pandas - Accessing DataFrame
- Python Pandas - Slicing a DataFrame Object
- Python Pandas - Modifying DataFrame
- Python Pandas - Removing Rows from a DataFrame
- Python Pandas - Arithmetic Operations on DataFrame
- Python Pandas - IO Tools
- Python Pandas - IO Tools
- Python Pandas - Working with CSV Format
- Python Pandas - Reading & Writing JSON Files
- Python Pandas - Reading Data from an Excel File
- Python Pandas - Writing Data to Excel Files
- Python Pandas - Working with HTML Data
- Python Pandas - Clipboard
- Python Pandas - Working with HDF5 Format
- Python Pandas - Comparison with SQL
- Python Pandas - Data Handling
- Python Pandas - Sorting
- Python Pandas - Reindexing
- Python Pandas - Iteration
- Python Pandas - Concatenation
- Python Pandas - Statistical Functions
- Python Pandas - Descriptive Statistics
- Python Pandas - Working with Text Data
- Python Pandas - Function Application
- Python Pandas - Options & Customization
- Python Pandas - Window Functions
- Python Pandas - Aggregations
- Python Pandas - Merging/Joining
- Python Pandas - MultiIndex
- Python Pandas - Basics of MultiIndex
- Python Pandas - Indexing with MultiIndex
- Python Pandas - Advanced Reindexing with MultiIndex
- Python Pandas - Renaming MultiIndex Labels
- Python Pandas - Sorting a MultiIndex
- Python Pandas - Binary Operations
- Python Pandas - Binary Comparison Operations
- Python Pandas - Boolean Indexing
- Python Pandas - Boolean Masking
- Python Pandas - Data Reshaping & Pivoting
- Python Pandas - Pivoting
- Python Pandas - Stacking & Unstacking
- Python Pandas - Melting
- Python Pandas - Computing Dummy Variables
- Python Pandas - Categorical Data
- Python Pandas - Categorical Data
- Python Pandas - Ordering & Sorting Categorical Data
- Python Pandas - Comparing Categorical Data
- Python Pandas - Handling Missing Data
- Python Pandas - Missing Data
- Python Pandas - Filling Missing Data
- Python Pandas - Interpolation of Missing Values
- Python Pandas - Dropping Missing Data
- Python Pandas - Calculations with Missing Data
- Python Pandas - Handling Duplicates
- Python Pandas - Duplicated Data
- Python Pandas - Counting & Retrieving Unique Elements
- Python Pandas - Duplicated Labels
- Python Pandas - Grouping & Aggregation
- Python Pandas - GroupBy
- Python Pandas - Time-series Data
- Python Pandas - Date Functionality
- Python Pandas - Timedelta
- Python Pandas - Sparse Data Structures
- Python Pandas - Sparse Data
- Python Pandas - Visualization
- Python Pandas - Visualization
- Python Pandas - Additional Concepts
- Python Pandas - Caveats & Gotchas
Python Pandas - Grouping Categorical Data
Grouping categorical data in Pandas is a useful technique for summarizing and analyzing datasets. Categorical variables in Pandas are often represented by the Categorical type, which provides an efficient way to handle variables with a limited set of possible values (categories or labels), such as days of the week, gender "male" or "female," or product ratings on a scale such as "poor," "average," and "excellent."
In this tutorial, we will learn how to group categorical data using pandas, focusing on the effect of ordered categories and the observed parameter.
Grouping Categorical Data
The pandas groupby() method can be used to group data by a categorical column, allowing for efficient data aggregation.
Example
Now let's use the groupby() method to summarize the data based on the categorical column. You can use the observed and sort parameters to control how the grouping is done.
import pandas as pd
# Define categorical data
days = pd.Categorical(
values=["Wed", "Mon", "Thu", "Mon", "Wed", "Sat"],
categories=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
ordered=True)
# Create DataFrame
df= pd.DataFrame({
"day": days,
"workers": [3, 4, 1, 4, 2, 2]})
# Display the Input Categorical DataFrame
print("Input Categorical DataFrame:")
print(df)
# Grouping by 'day' with all categories including unused
result = df.groupby("day", observed=False, sort=True).sum()
# Display the Grouped categorical Data
print('\nGrouped categorical Data:')
print(result)
When we run above program, it produces following result −
Input Categorical DataFrame:| day | workers | |
|---|---|---|
| 0 | Wed | 3 |
| 1 | Mon | 4 |
| 2 | Thu | 1 |
| 3 | Mon | 4 |
| 4 | Wed | 2 |
| 5 | Sat | 2 |
| workers | |
|---|---|
| day | |
| Mon | 8 |
| Tue | 0 |
| Wed | 5 |
| Thu | 1 |
| Fri | 0 |
| Sat | 2 |
| Sun | 0 |
When you group by categorical variables with observed=False, Pandas will include all categories, even those that are not present in the data like Tue, Fri, and Sun in the above output.
Grouping Only the Observed Categories
If you want to group only the observed categories, you can set observed=True. This will exclude any unused categories from the result.
Example
This example shows grouping only the observed categories in a categorical data using the groupby() method by setting the observed=True.
import pandas as pd
# Define categorical data
days = pd.Categorical(
values=["Wed", "Mon", "Thu", "Mon", "Wed", "Sat"],
categories=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
ordered=True)
# Create DataFrame
df= pd.DataFrame({
"day": days,
"workers": [3, 4, 1, 4, 2, 2]})
# Display the Input Categorical DataFrame
print("Input Categorical DataFrame:")
print(df)
# Grouping by 'day' with only observed categories
result = df.groupby("day", observed=True).sum()
# Display the Grouped categorical Data
print('\nGrouped categorical Data:')
print(result)
While executing the above code we get the following output −
Input Categorical DataFrame:| day | workers | |
|---|---|---|
| 0 | Wed | 3 |
| 1 | Mon | 4 |
| 2 | Thu | 1 |
| 3 | Mon | 4 |
| 4 | Wed | 2 |
| 5 | Sat | 2 |
| workers | |
|---|---|
| day | |
| Mon | 8 |
| Wed | 5 |
| Thu | 1 |
| Sat | 2 |
As you can see, categories like Tue, Fri, and Sun are not included in the result because they were not observed in the dataset.
Grouping by Multiple Categorical Columns
You can also group data by multiple categorical columns with the groupby() method, and you can use the observed parameter to control how missing categories are handled.
Example
The following example demonstrates grouping the categorical data by multiple columns using the groupby() method.
import pandas as pd
# Define categorical data
days = pd.Categorical(
values=["Wed", "Mon", "Thu", "Mon", "Wed", "Sat"],
categories=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
ordered=True)
# Creating another Categorical column for 'grade'
grades = pd.Categorical(["good", "good", "very bad", "very good", "very good", "good"], categories=["very bad", "bad", "medium", "good", "very good"])
# Create DataFrame
df= pd.DataFrame({
"day": days,
"workers": [3, 4, 1, 4, 2, 2],
"grades": grades})
# Display the Input Categorical DataFrame
print("Input Categorical DataFrame:")
print(df)
# Grouping Multiple Categorical Columns
result = df.groupby(["day", "grades"], observed=False).sum()
# Display the Grouped categorical Data
print('\nGrouped categorical Data by Multiple Columns:')
print(result)
Following is an output of the above code −
Input Categorical DataFrame:| day | workers | grades | |
|---|---|---|---|
| 0 | Wed | 3 | good |
| 1 | Mon | 4 | good |
| 2 | Thu | 1 | very bad |
| 3 | Mon | 4 | very good |
| 4 | Wed | 2 | very good |
| 5 | Sat | 2 | good |
| workers | ||
|---|---|---|
| day | grades | |
| Mon | very bad | 0 |
| bad | 0 | |
| medium | 0 | |
| good | 4 | |
| very good | 4 | |
| Tue | very bad | 0 |
| bad | 0 | |
| medium | 0 | |
| good | 0 | |
| very good | 0 | |
| Wed | very bad | 0 |
| bad | 0 | |
| medium | 0 | |
| good | 3 | |
| very good | 2 | |
| Thu | very bad | 1 |
| bad | 0 | |
| medium | 0 | |
| good | 0 | |
| very good | 0 | |
| Fri | very bad | 0 |
| bad | 0 | |
| medium | 0 | |
| good | 0 | |
| very good | 0 | |
| Sat | very bad | 0 |
| bad | 0 | |
| medium | 0 | |
| good | 2 | |
| very good | 0 | |
| Sun | very bad | 0 |
| bad | 0 | |
| medium | 0 | |
| good | 0 | |
| very good | 0 |
The result includes all combinations of categories from both columns, even if some combinations are missing.
Example: Another example of Grouping Multiple Categorical Columns
The following another example demonstrates how to group data using multiple categorical columns efficiently.
import pandas as pd
# Define grouping columns
group_cols = ['Cat_col1', 'Cat_col2', 'Cat_col3']
# Create DataFrame
df = pd.DataFrame([
['A', 'B', 'C', 5.4],
['A', 'B', 'D', 11.23],
['B', 'A', 'C', 94.12],
['B', 'A', 'A', 165.2],
['A', 'B', 'D', 565.6]],
columns=(group_cols + ['Value']))
# Convert columns to categorical types
for col in group_cols:
df[col] = df[col].astype('category')
# Grouping the data
result = df.groupby(group_cols, as_index=False, observed=True).sum()
# Display the Result
print(result)
When we run above program, it produces following result −
| Cat_col1 | Cat_col2 | Cat_col3 | Value | |
|---|---|---|---|---|
| 0 | A | B | C | 5.40 |
| 1 | A | B | D | 576.83 |
| 2 | B | A | A | 165.20 |
| 3 | B | A | C | 94.12 |