Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to select the largest of each group in Python Pandas DataFrame?
When analyzing data, you often need to find the row with the largest value in each group. This tutorial shows how to select the most popular movie for each year from a movies dataset using Python Pandas.
Preparing the Dataset
Let's start by loading a movies dataset and examining its structure ?
import pandas as pd
import numpy as np
# Load movies dataset
movies = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv")
# Display sample rows
print("Sample data:")
print(movies.sample(n=3))
Sample data:
budget id original_language original_title popularity release_date \
235 22000000 235260 en Son of God 9.175762 28/02/2014
1548 0 13411 en Malibu's Most Wanted 7.314796 10/04/2003
3279 8000000 26306 en Prefontaine 8.717235 24/01/1997
revenue runtime status title vote_average vote_count
235 67800064 138.0 Released Son of God 5.9 83
1548 0 86.0 Released Malibu's Most Wanted 4.7 77
3279 589304 106.0 Released Prefontaine 6.7 21
Data Preprocessing
First, let's extract the year from the release date to group movies by year ?
# Create year column from release_date
movies['year'] = pd.to_datetime(movies['release_date']).dt.year
# Check data types
print("Data types:")
print(movies[['title', 'year', 'popularity']].dtypes)
Data types: title object year int64 popularity float64 dtype: object
Method 1: Using sort_values() and drop_duplicates()
Sort by year and popularity, then keep only the first (highest) entry for each year ?
# Select relevant columns and sort by year and popularity
result_method1 = (movies[["title", "year", "popularity"]]
.sort_values(["year", "popularity"], ascending=[True, False])
.drop_duplicates(subset="year"))
print("Top 5 most popular movies by year (Method 1):")
print(result_method1.head())
Top 5 most popular movies by year (Method 1):
title year popularity
4592 Intolerance 1916 3.232447
4661 The Big Parade 1925 0.785744
2638 Metropolis 1927 32.351527
4457 Pandora's Box 1929 1.824184
3804 Hell's Angels 1930 8.484123
Method 2: Using groupby() with idxmax()
Group by year and find the index of the maximum popularity value ?
# Group by year and get index of maximum popularity
idx = movies.groupby('year')['popularity'].idxmax()
result_method2 = movies.loc[idx, ['title', 'year', 'popularity']]
print("Top 5 most popular movies by year (Method 2):")
print(result_method2.head())
Top 5 most popular movies by year (Method 2):
title year popularity
4592 Intolerance 1916 3.232447
4661 The Big Parade 1925 0.785744
2638 Metropolis 1927 32.351527
4457 Pandora's Box 1929 1.824184
3804 Hell's Angels 1930 8.484123
Method 3: Using groupby() with apply()
Apply a custom function to each group to get the row with maximum popularity ?
# Group by year and apply function to get max popularity row
result_method3 = (movies[["title", "year", "popularity"]]
.groupby("year")
.apply(lambda df: df.loc[df['popularity'].idxmax()])
.reset_index(drop=True))
print("Top 5 most popular movies by year (Method 3):")
print(result_method3.head())
Top 5 most popular movies by year (Method 3):
title year popularity
0 Intolerance 1916 3.232447
1 The Big Parade 1925 0.785744
2 Metropolis 1927 32.351527
3 Pandora's Box 1929 1.824184
4 Hell's Angels 1930 8.484123
Comparison of Methods
| Method | Performance | Memory Usage | Best For |
|---|---|---|---|
| sort_values() + drop_duplicates() | Medium | High | Simple cases |
| groupby() + idxmax() | Fast | Low | Large datasets |
| groupby() + apply() | Slow | Medium | Complex operations |
Viewing Recent Popular Movies
Let's see the most popular movies from recent years ?
# Get most popular movies from 2010 onwards
recent_popular = result_method2[result_method2['year'] >= 2010].sort_values('year', ascending=False)
print("Most popular movies by year (2010+):")
print(recent_popular.head(10))
Most popular movies by year (2010+):
title year popularity
4255 Growing Up Smith 2017 0.710870
788 Deadpool 2016 514.569956
546 Minions 2015 875.581305
95 Interstellar 2014 724.247784
124 Frozen 2013 165.125366
2577 Skyfall 2012 271.230817
1313 Harry Potter and the Deathly... 2011 321.795184
2913 Toy Story 3 2010 238.172433
Conclusion
Use groupby().idxmax() for the most efficient way to find the largest value in each group. The sort_values() + drop_duplicates() approach works well for smaller datasets and is more intuitive to understand.
