How to add group-level summary statistics as a new column in Pandas?


Pandas is an extremely popular data handling library used frequently for data manipulation and analysis. The Pandas library offers powerful features for analysis such as grouping to analyze various samples having some common features. In this article, we are going to learn how to add these summary statistics obtained through groups of samples as a new column in our existing Pandas dataframes.

NOTE − The code in this article was run on a jupyter notebook.

Let's begin by importing Pandas.

import pandas as pd

Example

Following is the sample d ataset we will work on. It has 3 columns storing the name of the student, subject name and the respective score of a student in that subject. We will find out various statistics for the scores of each student.

data = {'name': ['Alice', 'Alice', 'Alice', 'Alice', 'Bob', 'Bob', 'Bob',
   'Bob', 'Charlie', 'Charlie', 'Charlie', 'Charlie'],
   'subject': ['English', 'Science', 'Maths', 'History', 'English',
   'Science', 'Maths', 'History', 'English', 'Science', 'Maths', 'History'],
   'score': [87, 92, 78, 65, 76, 89, 91, 81, 90, 85, 88, 93]}
df = pd.DataFrame(data)
df

Output

   name    subject   score
0  Alice   English    87
1  Alice   Science    92
2  Alice   Maths      78
3  Alice  History     65
4  Bob    English     76
5  Bob    Science     89
6  Bob    Maths       91
7  Bob    History     81
8 Charlie  English    90
9 Charlie  Science    85
10 Charlie  Maths     88
11 Charlie  History   93

We will first group the dataset according to the names of the students as follows.

df_grp_name = df.groupby('name')
df_grp_name
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000019F5C3889D0>

Here `df_grp_name` is a pandas.DataFrame.groupby object returned by the grouping operation we did on the `name` column of `df`. Now we will add a new column to store the maximum scores of each student as follows.

df['max_score'] = df_grp_name['score'].transform('max')
df
   name   subject   score   max_score
0  Alice   English   87      92
1  Alice   Science   92      92
2  Alice   Maths     78      92
3  Alice  History    65      92
4  Bob    English    76      91
5  Bob    Science    89      91
6  Bob     Maths     91      91
7  Bob    History    81      91
8 Charlie  English   90      93
9 Charlie  Science   85      93
10 Charlie  Maths    88      93
11 Charlie History   93      93

Here we first access the `score` column of our dataframe and apply a `max` transformation.

Example

We can also do this entire thing in a single line as follows −

df['max_score_2'] = df.groupby('name')['score'].transform('max')
df

Output

    name     subject    score   max_score  max_score_2
0   Alice    English     87        92         92
1   Alice    Science     92        92         92
2   Alice    Maths       78        92         92
3   Alice    History     65        92         92
4    Bob     English     76        91         91
5    Bob     Science     89        91         91
6    Bob      Maths      91        91         91
7    Bob     History     81        91         91
8  Charlie  English      90        93         93
9  Charlie  Science      85        93         93
10 Charlie  Maths        88        93         93
11 Charlie  History      93        93         93

NOTE − We can also access columns using the dot notation like below.

df['max_score_3'] = df.groupby('name').score.transform('max')
df
    name     subject    score   max_score  max_score_2  max_score_3
0   Alice    English     87        92         92           92                 
1   Alice    Science     92        92         92           92
2   Alice    Maths       78        92         92           92
3   Alice    History     65        92         92           92
4    Bob     English     76        91         91           91
5    Bob     Science     89        91         91           91
6    Bob      Maths      91        91         91           91
7    Bob     History     81        91         91           91
8  Charlie  English      90        93         93           93
9  Charlie  Science      85        93         93           93
10 Charlie  Maths        88        93         93           93
11 Charlie  History      93        93         93           93

Example

Above is a great way to add a single or a few statistical values. However it can quickly get tedious so let's see how to add multiple statistics in one go! Let's now recreate our original dataset.

df = pd.DataFrame(data)

Let's find out the different statistical values we want to find as shown below.

df_agg = df.groupby(['name'])['score'].agg([min, max])
df_agg

Output

name  min  max
Alice  65   92
Bob    76   91
Charlie 85  93

Here we first group `df` by the column `name` and aggregate different `min` and `max` values. We can see we have a new dataframe which stores all the aggregated values.

Example

Now we will do a 'join' operation on our original dataframe and on this one to incorporate the statistical summaries. We can do it as follows −

df = pd.merge(df, df_agg, on='name', how='left')
df

Output

    name   subject   score   min  max
0   Alice  English    87     65   92
1   Alice  Science    92     65   92
2   Alice   Maths     78     65   92
3   Alice  History    65     65   92
4    Bob   English    76     76   91
5    Bob   Science    89     76   91
6    Bob    Maths     91     76   91
7    Bob   History    81     76   91
8  Charlie  English   90     85   93
9  Charlie  Science   85     85   93
10 Charlie   Maths    88     85   93
11 Charlie  History   93     85   93

Here we have used the Pandas `merge` method and we are joining on the column `name` (i.e. the rows are being matched based on this column) and we have specified the type join as a left join.

Conclusion

This article taught us several ways on how to add summary statistics as a new column to our Pandas dataframe. We saw how to add a single as well as multiple statistics in one go. You can now use what you just learned in your projects and different applications.

Updated on: 23-Mar-2023

155 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements