- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
Physics
Chemistry
Biology
Mathematics
English
Economics
Psychology
Social Studies
Fashion Studies
Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
How to add group-level summary statistics as a new column in Pandas?
Pandas is an extremely popular data handling library used frequently for data manipulation and analysis. The Pandas library offers powerful features for analysis such as grouping to analyze various samples having some common features. In this article, we are going to learn how to add these summary statistics obtained through groups of samples as a new column in our existing Pandas dataframes.
NOTE − The code in this article was run on a jupyter notebook.
Let's begin by importing Pandas.
import pandas as pd
Example
Following is the sample d ataset we will work on. It has 3 columns storing the name of the student, subject name and the respective score of a student in that subject. We will find out various statistics for the scores of each student.
data = {'name': ['Alice', 'Alice', 'Alice', 'Alice', 'Bob', 'Bob', 'Bob', 'Bob', 'Charlie', 'Charlie', 'Charlie', 'Charlie'], 'subject': ['English', 'Science', 'Maths', 'History', 'English', 'Science', 'Maths', 'History', 'English', 'Science', 'Maths', 'History'], 'score': [87, 92, 78, 65, 76, 89, 91, 81, 90, 85, 88, 93]} df = pd.DataFrame(data) df
Output
name subject score 0 Alice English 87 1 Alice Science 92 2 Alice Maths 78 3 Alice History 65 4 Bob English 76 5 Bob Science 89 6 Bob Maths 91 7 Bob History 81 8 Charlie English 90 9 Charlie Science 85 10 Charlie Maths 88 11 Charlie History 93
We will first group the dataset according to the names of the students as follows.
df_grp_name = df.groupby('name') df_grp_name
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000019F5C3889D0>
Here `df_grp_name` is a pandas.DataFrame.groupby object returned by the grouping operation we did on the `name` column of `df`. Now we will add a new column to store the maximum scores of each student as follows.
df['max_score'] = df_grp_name['score'].transform('max') df
name subject score max_score 0 Alice English 87 92 1 Alice Science 92 92 2 Alice Maths 78 92 3 Alice History 65 92 4 Bob English 76 91 5 Bob Science 89 91 6 Bob Maths 91 91 7 Bob History 81 91 8 Charlie English 90 93 9 Charlie Science 85 93 10 Charlie Maths 88 93 11 Charlie History 93 93
Here we first access the `score` column of our dataframe and apply a `max` transformation.
Example
We can also do this entire thing in a single line as follows −
df['max_score_2'] = df.groupby('name')['score'].transform('max') df
Output
name subject score max_score max_score_2 0 Alice English 87 92 92 1 Alice Science 92 92 92 2 Alice Maths 78 92 92 3 Alice History 65 92 92 4 Bob English 76 91 91 5 Bob Science 89 91 91 6 Bob Maths 91 91 91 7 Bob History 81 91 91 8 Charlie English 90 93 93 9 Charlie Science 85 93 93 10 Charlie Maths 88 93 93 11 Charlie History 93 93 93
NOTE − We can also access columns using the dot notation like below.
df['max_score_3'] = df.groupby('name').score.transform('max') df
name subject score max_score max_score_2 max_score_3 0 Alice English 87 92 92 92 1 Alice Science 92 92 92 92 2 Alice Maths 78 92 92 92 3 Alice History 65 92 92 92 4 Bob English 76 91 91 91 5 Bob Science 89 91 91 91 6 Bob Maths 91 91 91 91 7 Bob History 81 91 91 91 8 Charlie English 90 93 93 93 9 Charlie Science 85 93 93 93 10 Charlie Maths 88 93 93 93 11 Charlie History 93 93 93 93
Example
Above is a great way to add a single or a few statistical values. However it can quickly get tedious so let's see how to add multiple statistics in one go! Let's now recreate our original dataset.
df = pd.DataFrame(data)
Let's find out the different statistical values we want to find as shown below.
df_agg = df.groupby(['name'])['score'].agg([min, max]) df_agg
Output
name min max Alice 65 92 Bob 76 91 Charlie 85 93
Here we first group `df` by the column `name` and aggregate different `min` and `max` values. We can see we have a new dataframe which stores all the aggregated values.
Example
Now we will do a 'join' operation on our original dataframe and on this one to incorporate the statistical summaries. We can do it as follows −
df = pd.merge(df, df_agg, on='name', how='left') df
Output
name subject score min max 0 Alice English 87 65 92 1 Alice Science 92 65 92 2 Alice Maths 78 65 92 3 Alice History 65 65 92 4 Bob English 76 76 91 5 Bob Science 89 76 91 6 Bob Maths 91 76 91 7 Bob History 81 76 91 8 Charlie English 90 85 93 9 Charlie Science 85 85 93 10 Charlie Maths 88 85 93 11 Charlie History 93 85 93
Here we have used the Pandas `merge` method and we are joining on the column `name` (i.e. the rows are being matched based on this column) and we have specified the type join as a left join.
Conclusion
This article taught us several ways on how to add summary statistics as a new column to our Pandas dataframe. We saw how to add a single as well as multiple statistics in one go. You can now use what you just learned in your projects and different applications.