How to select the largest of each group in Python Pandas DataFrame?

When analyzing data, you often need to find the row with the largest value in each group. This tutorial shows how to select the most popular movie for each year from a movies dataset using Python Pandas.

Preparing the Dataset

Let's start by loading a movies dataset and examining its structure ?

import pandas as pd
import numpy as np

# Load movies dataset
movies = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv")

# Display sample rows
print("Sample data:")
print(movies.sample(n=3))
Sample data:
    budget      id original_language original_title  popularity release_date  \
235  22000000  235260                en       Son of God    9.175762   28/02/2014   
1548        0   13411                en  Malibu's Most Wanted    7.314796   10/04/2003   
3279  8000000   26306                en      Prefontaine    8.717235   24/01/1997   

      revenue  runtime    status             title  vote_average  vote_count
235   67800064    138.0  Released       Son of God           5.9          83
1548         0     86.0  Released  Malibu's Most Wanted           4.7          77
3279    589304    106.0  Released        Prefontaine           6.7          21

Data Preprocessing

First, let's extract the year from the release date to group movies by year ?

# Create year column from release_date
movies['year'] = pd.to_datetime(movies['release_date']).dt.year

# Check data types
print("Data types:")
print(movies[['title', 'year', 'popularity']].dtypes)
Data types:
title         object
year           int64
popularity   float64
dtype: object

Method 1: Using sort_values() and drop_duplicates()

Sort by year and popularity, then keep only the first (highest) entry for each year ?

# Select relevant columns and sort by year and popularity
result_method1 = (movies[["title", "year", "popularity"]]
                 .sort_values(["year", "popularity"], ascending=[True, False])
                 .drop_duplicates(subset="year"))

print("Top 5 most popular movies by year (Method 1):")
print(result_method1.head())
Top 5 most popular movies by year (Method 1):
                    title  year  popularity
4592           Intolerance  1916    3.232447
4661         The Big Parade  1925    0.785744
2638            Metropolis  1927   32.351527
4457         Pandora's Box  1929    1.824184
3804          Hell's Angels  1930    8.484123

Method 2: Using groupby() with idxmax()

Group by year and find the index of the maximum popularity value ?

# Group by year and get index of maximum popularity
idx = movies.groupby('year')['popularity'].idxmax()
result_method2 = movies.loc[idx, ['title', 'year', 'popularity']]

print("Top 5 most popular movies by year (Method 2):")
print(result_method2.head())
Top 5 most popular movies by year (Method 2):
                    title  year  popularity
4592           Intolerance  1916    3.232447
4661         The Big Parade  1925    0.785744
2638            Metropolis  1927   32.351527
4457         Pandora's Box  1929    1.824184
3804          Hell's Angels  1930    8.484123

Method 3: Using groupby() with apply()

Apply a custom function to each group to get the row with maximum popularity ?

# Group by year and apply function to get max popularity row
result_method3 = (movies[["title", "year", "popularity"]]
                 .groupby("year")
                 .apply(lambda df: df.loc[df['popularity'].idxmax()])
                 .reset_index(drop=True))

print("Top 5 most popular movies by year (Method 3):")
print(result_method3.head())
Top 5 most popular movies by year (Method 3):
                    title  year  popularity
0              Intolerance  1916    3.232447
1           The Big Parade  1925    0.785744
2               Metropolis  1927   32.351527
3            Pandora's Box  1929    1.824184
4            Hell's Angels  1930    8.484123

Comparison of Methods

Method Performance Memory Usage Best For
sort_values() + drop_duplicates() Medium High Simple cases
groupby() + idxmax() Fast Low Large datasets
groupby() + apply() Slow Medium Complex operations

Viewing Recent Popular Movies

Let's see the most popular movies from recent years ?

# Get most popular movies from 2010 onwards
recent_popular = result_method2[result_method2['year'] >= 2010].sort_values('year', ascending=False)
print("Most popular movies by year (2010+):")
print(recent_popular.head(10))
Most popular movies by year (2010+):
                              title  year  popularity
4255                Growing Up Smith  2017    0.710870
788                         Deadpool  2016  514.569956
546                          Minions  2015  875.581305
95                      Interstellar  2014  724.247784
124                           Frozen  2013  165.125366
2577                        Skyfall  2012  271.230817
1313  Harry Potter and the Deathly...  2011  321.795184
2913                    Toy Story 3  2010  238.172433

Conclusion

Use groupby().idxmax() for the most efficient way to find the largest value in each group. The sort_values() + drop_duplicates() approach works well for smaller datasets and is more intuitive to understand.

Updated on: 2026-03-25T11:55:57+05:30

743 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements