Python Pandas - GroupBy
Pandas groupby() is an essential method for data aggregation and analysis in python. It follows the "Split-Apply-Combine" pattern, which means it allows users to −
Split data into groups based on specific criteria.
Apply functions independently to each group.
Combine the results into a structured format.
In this tutorial, we will learn about basics of groupby operations in pandas, such as splitting data, viewing groups, and selecting specific groups using an example dataset.
Introduction to GroupBy Operations
Every groupby() operation involves three key steps, splitting data into groups based on some criteria, apply functions independently to each group, and then merge the results back into a meaningful structure.
In many situations, we apply some functions on each splitted groups. In the apply functionality, we can perform the following operations −
Aggregation: Computing summary statistics like mean, sum, etc.
Transformation: Applying a function to transform data.
Filtration: Removing groups based on some condition.
Split Data into Groups
Pandas objects can be split into groups based on any of their column values using the groupby() method.
Example
Let us now see how the grouping objects can be applied to the Pandas DataFrame using the groupby() method.
# import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
# Display the Original DataFrame
print("Original DataFrame:")
print(df)
# Display the Grouped Data
print('\nGrouped Data:')
print(df.groupby('Team'))
Following is the output of the above code −
Original DataFrame:
| Team | Rank | Year | Points | |
|---|---|---|---|---|
| 0 | Riders | 1 | 2014 | 876 |
| 1 | Riders | 2 | 2015 | 789 |
| 2 | Devils | 2 | 2014 | 863 |
| 3 | Devils | 3 | 2015 | 673 |
| 4 | Kings | 3 | 2014 | 741 |
| 5 | kings | 4 | 2015 | 812 |
| 6 | Kings | 1 | 2016 | 756 |
| 7 | Kings | 1 | 2017 | 788 |
| 8 | Riders | 2 | 2016 | 694 |
| 9 | Royals | 4 | 2014 | 701 |
| 10 | Royals | 1 | 2015 | 804 |
| 11 | Riders | 2 | 2017 | 690 |
GroupBy with Multiple Columns
You can group data based on multiple columns by applying a list of column values to the groupby() method.
Example
Here is an example where the data is grouped by multiple columns.
# import the pandas library
import pandas as pd
# Create a DataFrame
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
# Display the Grouped Data
print('Grouped Data:')
print(df.groupby(['Team','Year']).groups)
Its output is as follows −
Grouped Data:
{('Devils', 2014): [2], ('Devils', 2015): [3], ('Kings', 2014): [4], ('Kings', 2016): [6], ('Kings', 2017): [7], ('Riders', 2014): [0], ('Riders', 2015): [1], ('Riders', 2016): [8], ('Riders', 2017): [11], ('Royals', 2014): [9], ('Royals', 2015): [10], ('kings', 2015): [5]}
Viewing Grouped Data
Once you have your data split into groups, you can view them using different methods. One of the simplest ways is to view how it has been internally stored using the .groups attribute.
Example
The following example demonstrates how to view the grouped data using the using the .groups attribute.
# import the pandas library
import pandas as pd
# Create DataFrame
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
print('Viewing Grouped Data:')
print(df.groupby('Team').groups)
Its output is as follows −
Viewing Grouped Data:
{'Devils': [2, 3], 'Kings': [4, 6, 7], 'Riders': [0, 1, 8, 11], 'Royals': [9, 10], 'kings': [5]}
Selecting a Specific Group
Using the get_group() method, we can select a specific group.
Example
The following example demonstrates selecting a group from a grouped data using the get_group() method.
# import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Year')
# Display the Selected Data
print('Selected Group Data:')
print(grouped.get_group(2014))
Its output is as follows −
Selected Group Data:
| Team | Rank | Year | Points | |
|---|---|---|---|---|
| 0 | Riders | 1 | 2014 | 876 |
| 2 | Devils | 2 | 2014 | 863 |
| 4 | Kings | 3 | 2014 | 741 |
| 9 | Royals | 4 | 2014 | 701 |