Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to select subset of data with Index Labels in Python Pandas?
Pandas provides powerful selection capabilities to extract subsets of data using either index positions or index labels. This article demonstrates how to select data using index labels with the .loc accessor.
The .loc attribute works similar to Python dictionaries, selecting data by index labels rather than positions. This is different from .iloc which selects by integer position like Python lists.
Setting Up the Dataset
Let's start by importing a movies dataset with the title as the index ?
import pandas as pd
movies = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv",
index_col="title",
usecols=["title", "budget", "vote_average", "vote_count"])
# Sort the index for better performance with string indices
movies.sort_index(inplace=True)
print(movies.head(3))
budget vote_average vote_count
title
#Horror 1500000 3.3 52
(500) Days of Summer 7500000 7.2 2904
10 Cloverfield Lane 15000000 6.8 2468
Selecting Single Rows by Label
Use square brackets [] with .loc to select a single row by its index label ?
# Extract data for "Spider-Man 3" print(movies.loc["Spider-Man 3"])
budget 2.580000e+08 vote_average 5.900000e+00 vote_count 3.576000e+03 Name: Spider-Man 3, dtype: float64
Selecting Ranges with Slicing
Use slicing to extract multiple rows between two labels. Unlike Python lists, string label slicing is inclusive of both endpoints ?
# Select all movies from "Alien" to "Avatar" (inclusive)
result = movies.loc["Alien":"Avatar"]
print(f"Selected {len(result)} movies")
print(result.head())
Selected 167 movies
budget vote_average vote_count
title
Alien 11000000 7.9 4470
Alien Zone 0 4.0 3
Alien: Resurrection 70000000 5.9 1365
Aliens 18500000 7.7 3220
Aliens in the Attic 45000000 5.3 244
Selecting Multiple Non-Consecutive Rows
Pass a list of labels to select specific movies that aren't adjacent ?
# Select specific movies by passing a list selected_movies = movies.loc[["Avatar", "Avengers: Age of Ultron"]] print(selected_movies)
budget vote_average vote_count
title
Avatar 237000000 7.2 11800
Avengers: Age of Ultron 280000000 7.3 6767
Handling Missing Labels
Modern pandas raises a KeyError when you try to select non-existent labels. You can check if a label exists first ?
# Check if a movie exists in the index
movie_exists = "When is Avengers next movie?" in movies.index
print(f"Movie exists: {movie_exists}")
# Alternative approach using query() to ignore missing labels
result = movies.query("index in ['Avatar', 'When is Avengers next Movie?']")
print(result)
Movie exists: False
budget vote_average vote_count
title
Avatar 237000000 7.2 11800
Key Points
| Feature | Behavior | Example |
|---|---|---|
| Single selection | Returns Series | .loc["Avatar"] |
| List selection | Returns DataFrame | .loc[["Avatar", "Alien"]] |
| Slice selection | Inclusive of endpoints | .loc["Alien":"Avatar"] |
| Missing labels | Raises KeyError | Check with in movies.index
|
Conclusion
The .loc accessor provides flexible label-based selection in pandas. Always sort string indices for better performance, and remember that label slicing is inclusive of both endpoints unlike Python lists.
