How to select a Subset Of Data Using lexicographical slicing in Python Pandas?

PythonServer Side ProgrammingProgrammingPandas

Introduction

Pandas have a dual selection capability to select the subset of data using the Index position or by using the Index labels. Inthis post, I will show you how to "Select a Subset Of Data Using lexicographical slicing".

Google is full of datasets. Search for movies dataset in kaggle.com. This post uses the movies data set from kaggle.

How to do it

  • Import the movies dataset with only the columns required for this example.

import pandas as pd
import numpy as np
movies = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv",index_col="title",
usecols=["title","budget","vote_average","vote_count"])
movies.sample(n=5)

budgetvote_averagevote_count
titile


Little Voice06.661
Grown Ups 2800000005.81155
The Best Years of Our Lives21000007.6143
Tusk28000005.1366
Operation Chromite05.829
  • I always recommend sorting the index, especially if the index is made up of strings. You will notice the difference if you aredealing with a huge dataset when your index is sorted.

What if I don't sort the index?

No problem your code is going to run forever. Just kidding, well if the index labels are unsorted then pandas have to traversethrough all the labels one by one to match your query. Just imagine an Oxford dictionary without an index page, what you aregoing to do? With the index sorted you can jump around quickly to a label you want to extract, so is the case with Pandastoo.

Let us check first if our index is sorted or not.

# check if the index is sorted or not ?
movies.index.is_monotonic

False

  • Clearly, the index is un sorted. We will try to select the movies starting with A%. This is like writing

select * from movies where title like'A%'

movies.loc["Aa":"Bb"]
select * from movies where title like 'A%'



---------------------------------------------------------------------------
ValueErrorTraceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind)
4844try:
-> 4845return self._searchsorted_monotonic(label, side) 4846except ValueError:

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in _searchsorted_monotonic(se lf, label, side)
4805
-> 4806raise ValueError("index must be monotonic increasing or decreasing")
4807

ValueError: index must be monotonic increasing or decreasing

During handling of the above exception, another exception occurred:

KeyErrorTraceback (most recent call last)
in
----> 1 movies.loc["Aa": "Bb"]

~\anaconda3\lib\site-packages\pandas\core\indexing.py in getitem (self, key)
1766
1767maybe_callable = com.apply_if_callable(key, self.obj)
-> 1768return self._getitem_axis(maybe_callable, axis=axis) 1769
1770def _is_scalar_access(self, key: Tuple):

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
1910if isinstance(key, slice):
1911self._validate_key(key, axis)
-> 1912return self._get_slice_axis(key, axis=axis) 1913elif com.is_bool_indexer(key):
1914return self._getbool_axis(key, axis=axis)

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_slice_axis(self, slice_ob j, axis)
1794
1795labels = obj._get_axis(axis)
-> 1796indexer = labels.slice_indexer(
1797slice_obj.start, slice_obj.stop, slice_obj.step, kind=self.name 1798)

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in slice_indexer(self, start, end, step, kind)
4711slice(1, 3)
4712"""
-> 4713start_slice, end_slice = self.slice_locs(start, end, step=step, kind=ki nd)
4714
4715# return a slice

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in slice_locs(self, start, en d, step, kind)
4924start_slice = None
4925if start is not None:
-> 4926start_slice = self.get_slice_bound(start, "left", kind) 4927if start_slice is None:
4928start_slice = 0

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind)
4846except ValueError:
4847# raise the original KeyError
-> 4848raise err
4849
4850if isinstance(slc, np.ndarray):

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind)
4840# we need to look up the label
4841try:
-> 4842slc = self.get_loc(label) 4843except KeyError as err:
4844try:

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method,

tolerance)
2646return self._engine.get_loc(key)
2647except KeyError:
-> 2648return self._engine.get_loc(self._maybe_cast_indexer(key))
2649indexer = self.get_indexer([key], method=method, tolerance=tolerance) 2650if indexer.ndim > 1 or indexer.size > 1:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine._get_loc_duplicates()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine._maybe_get_bool_indexer() KeyError: 'Aa'

  • Sort the index in ascending order and try the same command to take the advantage of sorting for lexicographical slicing.

True


  • Now our data is set and ready for lexicographical slicing. Let us now select all the movie titles starting with letter A till letter B.


budgetvote_averagevote_count
title


Abandon250000004.645
Abandoned05.827
Abduction350000005.6961
Aberdeen07.06
About Last Night125000006.0210
............
Battle for the Planet of the Apes17000005.5215
Battle of the Year200000005.988
Battle: Los Angeles700000005.51448
Battlefield Earth440000003.0255
Battleship2090000005.52114

292 rows × 3 columns

True

titlebudgetvote_averagevote_count
Æon Flux620000005.4703
xXx: State of the Union600000004.7549
xXx700000005.81424
eXistenZ150000006.7475
[REC]²56000006.4489

budget vote_average vote_count title 

This is a no brainer to see the empty DataFrame as the data is sorted in reverse order. Let us reverse the letters and run this again.

titlebudgetvote_averagevote_count
B-Girl05.57
Ayurveda: Art of Being3000005.53
Away We Go170000006.7189
Awake860000006.3395
Avengers: Age of Ultron2800000007.36767
............
About Last Night125000006.0210
Aberdeen07.06
Abduction350000005.6961
Abandoned05.827
Abandon250000004.645

228 rows × 3 columns

raja
Published on 05-Nov-2020 12:45:57
Advertisements