Python Pandas - Grouping & Aggregation
Python Pandas - GroupBy

Python Pandas - Visualization
Python Pandas - Visualization

Python Pandas - Additional Concepts
Python Pandas - Caveats & Gotchas

Selected Reading

Python Pandas - Indexing and Selecting Data

Quiz

In pandas, indexing and selecting data are crucial for efficiently working with data in Series and DataFrame objects. These operations help you to slice, dice, and access subsets of your data easily.

These operations involve retrieving specific parts of your data structure, whether it's a Series or DataFrame. This process is crucial for data analysis as it allows you to focus on relevant data, apply transformations, and perform calculations.

Indexing in pandas is essential because it provides metadata that helps with analysis, visualization, and interactive display. It automatically aligns data for easier manipulation and simplifies the process of getting and setting data subsets.

This tutorial will explore various methods to slice, dice, and manipulate data using Pandas, helping you understand how to access and modify subsets of your data.

Types of Indexing in Pandas

Similar to Python and NumPy indexing ([ ]) and attribute (.) operators, Pandas provides straightforward methods for accessing data within its data structures. However, because the data type being accessed can be unpredictable, relying exclusively on these standard operators may lead to optimization challenges.

Pandas provides several methods for indexing and selecting data, such as −

Label-Based Indexing with .loc
Integer Position-Based Indexing with .iloc
Indexing with Brackets []

Label-Based Indexing with .loc

The .loc indexer is used for label-based indexing, which means you can access rows and columns by their labels. It also supports boolean arrays for conditional selection.

.loc() has multiple access methods like −

single scalar label: Selects a single row or column, e.g., df.loc['a'].
list of labels: Select multiple rows or columns, e.g., df.loc[['a', 'b']].
Label Slicing: Use slices with labels, e.g., df.loc['a':'f'] (both start and end are included).
Boolean Arrays: Filter data based on conditions, e.g., df.loc[boolean_array].

loc takes two single/list/range operator separated by ','. The first one indicates the row and the second one indicates columns.

Example 1

Here is a basic example that selects all rows for a specific column using the loc indexer.

#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

print("Original DataFrame:\n", df)

#select all rows for a specific column
print('\nResult:\n',df.loc[:,'A'])

Its output is as follows −

Original DataFrame:
           A         B         C         D
a  0.962766 -0.195444  1.729083 -0.701897
b -0.552681  0.797465 -1.635212 -0.624931
c  0.581866 -0.404623 -2.124927 -0.190193
d -0.284274  0.019995 -0.589465  0.914940
e  0.697209 -0.629572 -0.347832  0.272185
f -0.181442 -0.000983  2.889981  0.104957
g  1.195847 -1.358104  0.110449 -0.341744
h -0.121682  0.744557  0.083820  0.355442

Result:
 a    0.962766
b   -0.552681
c    0.581866
d   -0.284274
e    0.697209
f   -0.181442
g    1.195847
h   -0.121682
Name: A, dtype: float64

Note: The output generated will vary with each execution because the DataFrame is created using NumPy's random number generator.

Example 2

This example selecting all rows for multiple columns.

# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

# Select all rows for multiple columns, say list[]
print(df.loc[:,['A','C']])

Its output is as follows −

            A           C
a    0.391548    0.745623
b   -0.070649    1.620406
c   -0.317212    1.448365
d   -2.162406   -0.873557
e    2.202797    0.528067
f    0.613709    0.286414
g    1.050559    0.216526
h    1.122680   -1.621420

Example 3

This example selects the specific rows for the specific columns.

# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

# Select few rows for multiple columns, say list[]
print(df.loc[['a','b','f','h'],['A','C']])

Its output is as follows −

           A          C
a   0.391548   0.745623
b  -0.070649   1.620406
f   0.613709   0.286414
h   1.122680  -1.621420

Example 4

The following example selecting a range of rows for all columns using the loc indexer.

# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

# Select range of rows for all columns
print(df.loc['c':'e'])

Its output is as follows −

          A         B         C         D
c  0.044589  1.966278  0.894157  1.798397
d  0.451744  0.233724 -0.412644 -2.185069
e -0.865967 -1.090676 -0.931936  0.214358

Integer Position-Based Indexing with .iloc

The .iloc indexer is used for integer-based indexing, which allows you to select rows and columns by their numerical position. This method is similar to standard python and numpy indexing (i.e. 0-based indexing).

Single Integer: Selects data by its position, e.g., df.iloc[0].
List of Integers: Select multiple rows or columns by their positions, e.g., df.iloc[[0, 1, 2]].
Integer Slicing: Use slices with integers, e.g., df.iloc[1:3].
Boolean Arrays: Similar to .loc, but for positions.

Example 1

Here is a basic example that selects 4 rows for the all column using the iloc indexer.

# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

print("Original DataFrame:\n", df)

# select all rows for a specific column
print('\nResult:\n',df.iloc[:4])

Its output is as follows −

Original DataFrame:
           A         B         C         D
0 -1.152267  2.206954 -0.603874  1.275639
1 -0.799114 -0.214075  0.283186  0.030256
2 -1.823776  1.109537  1.512704  0.831070
3 -0.788280  0.961695 -0.127322 -0.597121
4  0.764930 -1.310503  0.108259 -0.600038
5 -1.683649 -0.602324 -1.175043 -0.343795
6  0.323984 -2.314158  0.098935  0.065528
7  0.109998 -0.259021 -0.429467  0.224148

Result:
           A         B         C         D
0 -1.152267  2.206954 -0.603874  1.275639
1 -0.799114 -0.214075  0.283186  0.030256
2 -1.823776  1.109537  1.512704  0.831070
3 -0.788280  0.961695 -0.127322 -0.597121

Example 2

The following example selects the specific data using the integer slicing.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

# Integer slicing
print(df.iloc[:4])
print(df.iloc[1:5, 2:4])

Its output is as follows −

           A          B           C           D
0   0.699435   0.256239   -1.270702   -0.645195
1  -0.685354   0.890791   -0.813012    0.631615
2  -0.783192  -0.531378    0.025070    0.230806
3   0.539042  -1.284314    0.826977   -0.026251

           C          D
1  -0.813012   0.631615
2   0.025070   0.230806
3   0.826977  -0.026251
4   1.423332   1.130568

Example 3

This example selects the data using the slicing through list of values.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

# Slicing through list of values
print(df.iloc[[1, 3, 5], [1, 3]])

Its output is as follows −

           B           D
1   0.890791    0.631615
3  -1.284314   -0.026251
5  -0.512888   -0.518930

Direct Indexing with Brackets "[]"

Direct indexing with [] is a quick and intuitive way to access data, similar to indexing with Python dictionaries and NumPy arrays. Its often used for basic operations −

Single Column: Access a single column by its name.
Multiple Columns: Select multiple columns by passing a list of column names.
Row Slicing: Slice rows using integer-based indexing.

Example 1

This example demonstrates how to use the direct indexing with brackets for accessing a single column.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

# Accessing a Single Column
print(df['A'])

Its output is as follows −

0   -0.850937
1   -1.588211
2   -1.125260
3    2.608681
4   -0.156749
5    0.154958
6    0.396192
7   -0.397918
Name: A, dtype: float64

Example 2

This example selects the multiple columns using the direct indexing.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

# Accessing Multiple Columns
print(df[['A', 'B']])

Its output is as follows −

          A         B
0  0.167211 -0.080335
1 -0.104173  1.352168
2 -0.979755 -0.869028
3  0.168335 -1.362229
4 -1.372569  0.360735
5  0.428583 -0.203561
6 -0.119982  1.228681
7 -1.645357  0.331438

Previous Quiz Next