- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
Physics
Chemistry
Biology
Mathematics
English
Economics
Psychology
Social Studies
Fashion Studies
Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
How to select subset of data with Index Labels?
Pandas have a dual selection capability to select the subset of data using the Index position or by using the Index labels. In this post, I will show you how to “Select a Subset Of Data Using Index Labels” using the index label.
Remember, Python dictionaries and lists are built-in data structures that select their data either by using the index label or by index position. A dictionary’s key must be a string, integer, or tuple while a List must either use integers (the position) or slice objects for selection.
Pandas have .loc and.iloc attributes available to perform index operations in their own unique ways. ). With.iloc attribute, pandas select only by position and work similarly to Python lists. The .loc attribute selects only by index label, which is similar to how Python dictionaries work.
Select a Subset Of Data Using Index Labels with .loc[]
The loc and iloc attributes are available on both Series and DataFrame
Import the movies dataset with the title as index.
import pandas as pd movies = pd.read_csv( "movies_data.csv", index_col="title", usecols=["title","budget","vote_average","vote_count"] )
I always recommend sorting the index, especially if the index is made up of strings. You will notice the difference if you are dealing with a huge dataset when your index is sorted.
Input
movies.sort_index(inplace = True) movies.head(3)
Output
title | budget | vote_average | vote_count |
---|---|---|---|
(500) Days of Summer | 7500000 | 7.2 | 2904 |
10 Cloverfield Lane | 15000000 | 6.8 | 2468 |
10 Days in a Madhouse | 1200000 | 4.3 | 5 |
I have sorted the index using sort_index and "inplace = True" parameter.
One thing interesting about the syntax of the loc method is that it does not take parenthesis() rather takes square brackets[]. I think (might be wrong) this is because they wanted consistency i.e. you can use [] on a Series to extract rows, while applied on a Dataframe will fetch you the columns.
Input
# extract "Spider-Man 3" ( I'm not a big fan of spidy) movies.loc["Spider-Man 3"]
Output
budget 258000000.0 vote_average 5.9 vote_count 3576.0 Name: Spider-Man 3, dtype: float64
Use a slice to pull out many values. I'm going to pull the movies which I haven't watched. Because this is a string label we are going to get all the data for our search criteria including "Avatar".
Remember − If you work with Python List the last value is excluded but since we are working with strings it's inclusive.
movies.loc["Alien":"Avatar" ]
title | budget | vote_average | vote_count |
---|---|---|---|
Alien | 11000000 | 7.9 | 4470 |
Alien Zone | 0 | 4.0 | 3 |
Alien: Resurrection | 70000000 | 5.9 | 1365 |
Aliens | 18500000 | 7.7 | 3220 |
Aliens in the Attic | 45000000 | 5.3 | 244 |
- | - | - | |
Australia | 130000000 | 6.3 | 694 |
Auto Focus | 7000000 | 6.1 | 56 |
Automata | 7000000 | 5.6 | 670 |
Autumn in New York | 65000000 | 5.7 | 135 |
Avatar | 237000000 | 7.2 | 11800 |
167 rows × 3 columns
Can I get any two or more random movies that are not next to each other? Definitely yes, but you need to put more effort into passing a list of movies you need.
What I meant was you need to have square brackets with in a square bracket.
Input
movies.loc[["Avatar", "Avengers: Age of Ultron"]]
title | budget | vote_average | vote_count |
---|---|---|---|
Avatar | 237000000 | 7.2 | 11800 |
Avengers: Age of Ultron | 280000000 | 7.3 | 6767 |
Can I change the order of selection? Of course, you can help yourself by specifying the list of labels you need in an order.
While this looks cool to specify the list of labels you want to extract, do you know what happens if you spelled a value wrongly? Pandas would have stuck missing Values (NaN) for the wrongly spelled label. But those days are gone, with the latest updates it raises an exception.
Input
movies.loc[["Avengers: Age of Ultron","Avatar","When is Avengers next movie?"]]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-21-ebe975264840> in <module> ----> 1 movies.loc[["Avengers: Age of Ultron","Avatar","When is Avengers next movie?"]] ∽\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key) 1766 1767 maybe_callable = com.apply_if_callable(key, self.obj) −> 1768 return self._getitem_axis(maybe_callable, axis=axis) 1769 1770 def _is_scalar_access(self, key: Tuple): ∽\anaconda3\lib\site−packages\pandas\core\indexing.py in _getitem_axis(self, key, axis) 1952 raise ValueError("Cannot index with multidimensional key") 1953 −> 1954 return self._getitem_iterable(key, axis=axis) 1955 1956 # nested tuple slicing ∽\anaconda3\lib\site−packages\pandas\core\indexing.py in _getitem_iterable(self, key, axis) 1593 else: 1594 # A collection of keys −> 1595 keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False) 1596 return self.obj._reindex_with_indexers( 1597 {axis: [keyarr, indexer]}, copy=True, allow_dups=True ∽\anaconda3\lib\site−packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis, raise_missing) 1550 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr) 1551 −> 1552 self._validate_read_indexer( 1553 keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing 1554 ) ∽\anaconda3\lib\site−packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing) 1652 # just raising 1653 if not (ax.is_categorical() or ax.is_interval()): −> 1654 raise KeyError( 1655 "Passing list−likes to .loc or [] with any missing labels " 1656 "is no longer supported, see " KeyError: 'Passing list−likes to .loc or [] with any missing labels is no longer supported, see https://pandas.pydata.org/pandas−docs/stable/user_guide/indexing.html#deprecate−loc−reindex−listlike'
One way to take care is by checking the values in the index directly.
Input
"When is Avengers next movie?" in movies.index
If you want to ignore the error and move on you can use below approach
movies.query("title in ('Avatar','When is Avengers next Movie?')")
title | budget | vote_average | vote_count |
---|---|---|---|
Avatar | 237000000 | 7.2 | 11800 |