Python Pandas - GroupBy with MultiIndex



MultiIndexed data in Pandas provides more complex indexing using its multiple levels, this multi-leveled data can be particularly useful for representing higher-dimensional data in a two-dimensional format. This hierarchical structure provides a way to group data at different levels.

Pandas groupby() method allows you to work with multiIndex data for aggregation and analysis. When working with hierarchical (MultiIndex) data, this functionality becomes even more flexible, allowing us to group the data by different levels of the index.

In this tutorial, we will learn how to use the GroupBy functionality in Pandas with a MultiIndex DataFrame or Series.

Grouping by Index Levels

To group the data by one of the levels in the MultiIndex, we can use the level parameter in the groupby() method. This allows us to specify which level we want to group by, either by its number (0-based index) or by its name, if names have been assigned to the levels.

Example: Grouping by First Index Level

Here is an example of grouping the MultiIndexed Series object by its first index level.

import pandas as pd
import numpy as np

# Create a 2D list
list_2d = [["BMW", "BMW", "Lexus", "Lexus", "foo", "foo", "Audi", "Audi"],
["1", "2", "1", "2", "1", "2", "1", "2"]]

# Create a MultiIndex object
index = pd.MultiIndex.from_arrays(list_2d, names=["first", "second"])

# Creating a MultiIndexed Series
s = pd.Series(np.random.randn(8), index=index)

# Display the input MultiIndexed Series 
print("Input MultiIndexed Series:\n",s)

# Group the Series by the first index level
grouped = s.groupby(level=0)

print("Output Summary of the grouped data:")
print(grouped.sum())

Following is the output of the above code −

Input MultiIndexed Series:
First Second
BMW1-0.795467
2-0.132035
Lexus1-0.913917
2-0.875364
foo10.004405
2-0.336840
Audi1-0.513719
20.588359
dtype: float64 Output Summary of the grouped data: first Audi -0.406670 BMW -0.927503 Lexus -2.744018 foo -0.332435 dtype: float64

Grouping by Second Index Level

Similarly to the first index level, we can also group the data by its second index level, for this you can specifying the level name or its index values 1 to the level parameter.

Example

The following example demonstrates grouping the MultiIndex Series object by its second index level.

import pandas as pd
import numpy as np

# Create a 2D list
list_2d = [["BMW", "BMW", "Lexus", "Lexus", "foo", "foo", "Audi", "Audi"],
["1", "2", "1", "2", "1", "2", "1", "2"]]

# Create a MultiIndex object
index = pd.MultiIndex.from_arrays(list_2d, names=["first", "second"])

# Creating a MultiIndexed Series
s = pd.Series(np.random.randn(8), index=index)

# Display the input MultiIndexed Series 
print("Input MultiIndexed Series:\n",s)

# Group the Series by the second index level
grouped = s.groupby(level="second")

print("Output Summary of the grouped data:")
print(grouped.sum())

Following is the output of the above code −

Input MultiIndexed Series:
First Second
BMW11.046440
2-0.895963
Lexus1-0.292579
2-0.009580
foo10.004405
21.279683
Audi10.513284
2-0.250846
dtype: float64 Output Summary of the grouped data: second 1 1.238211 2 0.123295 dtype: float64

Grouping by Multiple Index Levels

Pandas allows you to group the MultiIndex data by it more than one index level applying the list of index levels to the level parameter of the groupby() method.

Example

This example groups the MultiIndexed Series object by multiple labels.

import pandas as pd
import numpy as np

# Create data for multi index
data = [["BMW", "BMW", "Lexus", "Lexus", "foo", "foo", "Audi", "Audi"],
["1", "2", "1", "2", "1", "2", "1", "2"], 
['red', 'black', 'red', 'black', 'red', 'black', 'red', 'black']]

# Create a MultiIndex object
index = pd.MultiIndex.from_arrays(data, names=["first", "second", "third"])

# Creating a MultiIndexed Series
s = pd.Series(np.random.randn(8), index=index)

# Display the input MultiIndexed Series 
print("Input MultiIndexed Series:\n",s)

# Group the Series by the first and third index levels
grouped = s.groupby(level=["first", "third"])

print("Output Summary of the grouped data:")
print(grouped.sum())

Following is the output of the above code −

Input MultiIndexed Series:
First Second Third
BMW1red0.681079
2black0.103199
Lexus1red-1.177623
2black-1.069462
foo1red1.015916
2black-0.548004
Audi1red0.646248
2black-1.130859
dtype: float64 Output Summary of the grouped data:
First Third
Audiblack-1.130859
red0.646248
BMWblack0.103199
red0.681079
Lexusblack-1.069462
red-1.177623
fooblack-0.548004
red1.015916
dtype: float64

Grouping DataFrame with Index Levels and Columns

A Pandas DataFrame can also be grouped by a combination of index levels and columns. This adds more flexibility in grouping operations, allowing you to aggregate data based on both row indices and column values.

Example

The following example demonstrates grouping the MultiIndexed DataFrame by its index level and column values.

import pandas as pd
import numpy as np

# Create a 2D list
list_2d = [["BMW", "BMW", "Lexus", "Lexus", "foo", "foo", "Audi", "Audi"],
["1", "2", "1", "2", "1", "2", "1", "2"]]

# Create a MultiIndex object
index = pd.MultiIndex.from_arrays(list_2d, names=["first", "second"])

# Creating a MultiIndexed DataFrame
df = pd.DataFrame({"A": [1, 1, 1, 1, 2, 2, 3, 3], "B": np.arange(8)}, index=index)

# Display the input MultiIndexed DataFrame
print("Input MultiIndexed DataFrame:\n")
print(df)

# Group the DataFrame by the second index level and the A column
grouped = df.groupby([pd.Grouper(level=1), "A"])

print("Output Summary of the grouped data:")
print(grouped.sum())

Following is the output of the above code −

Input MultiIndexed DataFrame:
A B
FirstSecond
BMW110
211
Lexus112
213
foo124
225
Audi136
237
Output Summary of the grouped data:
B
SecondA
112
24
36
214
25
37
Advertisements