How to select subset of data with Index Labels?

PandasServer Side ProgrammingProgramming

Pandas have a dual selection capability to select the subset of data using the Index position or by using the Index labels. In this post, I will show you how to “Select a Subset Of Data Using Index Labels” using the index label.

Remember, Python dictionaries and lists are built-in data structures that select their data either by using the index label or by index position. A dictionary’s key must be a string, integer, or tuple while a List must either use integers (the position) or slice objects for selection.

Pandas have .loc and.iloc attributes available to perform index operations in their own unique ways. ). With.iloc attribute, pandas select only by position and work similarly to Python lists. The .loc attribute selects only by index label, which is similar to how Python dictionaries work.

Select a Subset Of Data Using Index Labels with .loc[]

The loc and iloc attributes are available on both Series and DataFrame

  • Import the movies dataset with the title as index.

import pandas as pd
movies = pd.read_csv(
   "movies_data.csv",
   index_col="title",
   usecols=["title","budget","vote_average","vote_count"]
)
  • I always recommend sorting the index, especially if the index is made up of strings. You will notice the difference if you are dealing with a huge dataset when your index is sorted.

Input

movies.sort_index(inplace = True)
movies.head(3)

Output

titlebudgetvote_averagevote_count
(500) Days of Summer75000007.22904
10 Cloverfield Lane150000006.82468
10 Days in a Madhouse12000004.35

I have sorted the index using sort_index and "inplace = True" parameter.

  • One thing interesting about the syntax of the loc method is that it does not take parenthesis() rather takes square brackets[]. I think (might be wrong) this is because they wanted consistency i.e. you can use [] on a Series to extract rows, while applied on a Dataframe will fetch you the columns.

Input

# extract "Spider-Man 3" ( I'm not a big fan of spidy)
movies.loc["Spider-Man 3"]

Output

budget 258000000.0
vote_average 5.9
vote_count 3576.0
Name: Spider-Man 3, dtype: float64
  • Use a slice to pull out many values. I'm going to pull the movies which I haven't watched. Because this is a string label we are going to get all the data for our search criteria including "Avatar".

Remember − If you work with Python List the last value is excluded but since we are working with strings it's inclusive.

movies.loc["Alien":"Avatar" ]


titlebudgetvote_averagevote_count
Alien110000007.94470
Alien Zone04.03
Alien: Resurrection700000005.91365
Aliens185000007.73220
Aliens in the Attic450000005.3244
---
Australia1300000006.3694
Auto Focus70000006.156
Automata70000005.6670
Autumn in New York650000005.7135
Avatar2370000007.211800

167 rows × 3 columns

  • Can I get any two or more random movies that are not next to each other? Definitely yes, but you need to put more effort into passing a list of movies you need.

What I meant was you need to have square brackets with in a square bracket.

Input

movies.loc[["Avatar", "Avengers: Age of Ultron"]]


titlebudgetvote_averagevote_count
Avatar2370000007.211800
Avengers: Age of Ultron2800000007.36767
  • Can I change the order of selection? Of course, you can help yourself by specifying the list of labels you need in an order.

While this looks cool to specify the list of labels you want to extract, do you know what happens if you spelled a value wrongly? Pandas would have stuck missing Values (NaN) for the wrongly spelled label. But those days are gone, with the latest updates it raises an exception.

Input

movies.loc[["Avengers: Age of Ultron","Avatar","When is Avengers next movie?"]]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-21-ebe975264840> in <module>
----> 1 movies.loc[["Avengers: Age of Ultron","Avatar","When is Avengers next movie?"]]

∽\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
1766
1767 maybe_callable = com.apply_if_callable(key, self.obj)
−> 1768 return self._getitem_axis(maybe_callable, axis=axis)
1769
1770 def _is_scalar_access(self, key: Tuple):

∽\anaconda3\lib\site−packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
1952 raise ValueError("Cannot index with multidimensional key")
1953
−> 1954 return self._getitem_iterable(key, axis=axis)
1955
1956 # nested tuple slicing

∽\anaconda3\lib\site−packages\pandas\core\indexing.py in _getitem_iterable(self, key, axis)
1593 else:
1594 # A collection of keys
−> 1595 keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
1596 return self.obj._reindex_with_indexers(
1597 {axis: [keyarr, indexer]}, copy=True, allow_dups=True

∽\anaconda3\lib\site−packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
1550 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
1551
−> 1552 self._validate_read_indexer(
1553 keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
1554 )

∽\anaconda3\lib\site−packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
1652 # just raising
1653 if not (ax.is_categorical() or ax.is_interval()):
−> 1654 raise KeyError(
1655 "Passing list−likes to .loc or [] with any missing labels "
1656 "is no longer supported, see "

KeyError: 'Passing list−likes to .loc or [] with any missing labels is no longer supported, see https://pandas.pydata.org/pandas−docs/stable/user_guide/indexing.html#deprecate−loc−reindex−listlike'

One way to take care is by checking the values in the index directly.

Input

"When is Avengers next movie?" in movies.index

If you want to ignore the error and move on you can use below approach

movies.query("title in ('Avatar','When is Avengers next Movie?')")


titlebudgetvote_averagevote_count
Avatar2370000007.211800
raja
Updated on 23-Oct-2020 13:43:39

Advertisements