How to select subset of data with Index Labels in Python Pandas?


Introduction

Pandas have a dual selection capability to select the subset of data using the Index position or by using the Index labels. Inthis post, I will show you how to “Select a Subset Of Data Using Index Labels” using the index label.

Remember, Python dictionaries and lists are built-in data structures that select their data either by using the index label or byindex position. A dictionary’s key must be a string, integer, or tuple while a List must either use integers (the position) or sliceobjects for selection.

Pandas have .loc and.iloc attributes available to perform index operations in their own unique ways. ). With.iloc attribute,pandas select only by position and work similarly to Python lists. The .loc attribute selects only by index label, which is similarto how Python dictionaries work.

Select a Subset Of Data Using Index Labels with .loc[]

The loc and iloc attributes are available on both Series and DataFrame

1.Import the movies dataset with the title as index.

import pandas as pd
movies = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv",
index_col="title",
usecols=["title","budget","vote_average","vote_count"])

I always recommend sorting the index, especially if the index is made up of strings. You will notice the difference if youare dealing with a huge dataset when your index is sorted.

movies.sort_index(inplace = True)
movies.head(3)


     budget vote_average vote_count
title
___________________________________
#Horror 1500000 3.3 52
(500) Days of Summer 7500000 7.2 2904
10 Cloverfield Lane 15000000 6.8 2468

I have sorted the index using sort_index and "inplace = True" parameter.

1. One thing interesting about the syntax of the loc method is that it does not take parenthesis() rather takes squarebrackets[]. I think (might be wrong) this is because they wanted consistency i.e. you can use [] on a Series to extractrows, while applied on a Dataframe will fetch you the columns.

# extract "Spider-Man 3" ( I'm not a big fan of spidy)
movies.loc["Spider-Man 3"]


budget 258000000.0
vote_average 5.9
vote_count 3576.0
Name: Spider-Man 3, dtype: float64

1. Use a slice to pull out many values. I'm going to pull the movies which I haven't watched. Because this is a string labelwe are going to get all the data for our search criteria including "Avatar".

Remember - If you work with Python List the last value is excluded but since we are working with strings it's inclusive.

movies.loc["Alien":"Avatar"]


budget vote_average vote_count
title
Alien 11000000 7.9 4470
Alien Zone 0 4.0 3
Alien: Resurrection 70000000 5.9 1365
Aliens 18500000 7.7 3220
Aliens in the Attic 45000000 5.3 244
... ... ... ...
Australia 130000000 6.3 694
Auto Focus 7000000 6.1 56
Automata 7000000 5.6 670
Autumn in New York 65000000 5.7 135
Avatar 237000000 7.2 11800

167 rows × 3 columns

1. Can I get any two or more random movies that are not next to each other? Definitely yes, but you need to put more effortinto passing a list of movies you need.

What I meant was you need to have square brackets with in a square bracket.

movies.loc[["Avatar","Avengers: Age of Ultron"]]

budget vote_average vote_count
title
Avatar 237000000 7.2 11800
Avengers: Age of Ultron 280000000 7.3 6767

6. Can I change the order of selection? Of course, you can help yourself by specifying the list of labels you need in an order.

While this looks cool to specify the list of labels you want to extract, do you know what happens if you spelled a valuewrongly? Pandas would have stuck missing Values (NaN) for the wrongly spelled label. But those days are gone, with thelatest updates it raises an exception.

movies.loc[["Avengers: Age of Ultron","Avatar","When is Avengers next movie?"]]


---------------------------------------------------------------------------
KeyError
Traceback (most recent call last)
<ipython-input-6-ebe975264840> in <module>
----> 1 movies.loc[["Avengers: Age of Ultron","Avatar","When is Avengers next movie?"]]

~\anaconda3\lib\site-packages\pandas\core\indexing.py in
__getitem__
(self, key)
1766
1767 maybe_callable = com.apply_if_callable(
key,self.obj)
-> 1768
return self._getitem_axis(maybe_callable,axis = axis)
1769
1770 def_is_scalar_access(self,key:Tuple):
~\anaconda3\lib\site-packages\pandas\core\indexing.py
in
_getitem_axis
(self, key, axis)
1952 raiseValueError("Cannot index with multidimensional key")
1953
-> 1954 return self._getitem_iterable(key,
axis=axis)
1955
1956 # nested tuple slicing
~\anaconda3\lib\site-packages\pandas\core\indexing.py
in_getitem_iterable(self, key, axis)
1593 else:
1594 # A collection of keys
-> 1595 keyarr,indexer=self._get_listlike_indexer(key,axis,raise_missing=False)
1596 return self.obj._reindex_with_indexers(
1597 {axis:[keyarr,indexer]},copy=True,allow_dups=True
~\anaconda3\lib\site-packages\pandas\core\indexing.py
in
_get_listlike_indexer(self, key, axis, raise_missing)
1550 keyarr,indexer,new_indexer=ax._reindex_non_unique
(keyarr)
1551
-> 1552 self._validate_read_indexer(
1553 keyarr,indexer,o._get_axis_number
(axis),raise_missing=raise_missing
1554 )
~\anaconda3\lib\site-packages\pandas\core\indexing.py
in
_validate_read_indexer
(self, key, indexer, axis, raise_missing)
1652 # just raising
1653 ifnot(ax.is_categorical()orax.is_interval()
)
:
-> 1654 raise KeyError(
1655 "Passing list-likes to .loc or [] with any missing labels "
1656 "is no longer supported, see "

KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported, see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike'

One way to take care is by checking the values in the index directly.

"When is Avengers next movie?"in movies.index

Output

False

If you want to ignore the error and move on you can use below approach

movies.query("title in ('Avatar','When is Avengers next Movie?')")


budget vote_average vote_count
title
Avatar 237000000 7.2 11800

Updated on: 10-Nov-2020

559 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements