Article Categories

Selected Reading

How to select multiple DataFrame columns using regexp and datatypes

Python Server Side Programming Programming

A DataFrame is a two-dimensional tabular data structure with rows and columns, similar to a spreadsheet or database table. Selecting specific columns efficiently is crucial for data analysis. This article demonstrates how to select DataFrame columns using regular expressions and data types.

Creating a Sample DataFrame

Let's start by creating a DataFrame from a movies dataset ?

import pandas as pd

# Create DataFrame from CSV
movies_dataset = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv")

# Display basic info
print(type(movies_dataset))
print(movies_dataset.head())

<class 'pandas.core.frame.DataFrame'>
   budget     id original_language original_title  popularity release_date  \
0  237000000  19995                en         Avatar  150.437577   10/12/2009   
1  300000000    285                en  Pirates of...  139.082615   19/05/2007   
2  245000000 206647                en        Spectre  107.376788   26/10/2015   
3  250000000  49026                en  The Dark K...  112.312950   16/07/2012   
4  260000000  49529                en    John Carter   43.926995    7/03/2012   

       revenue  runtime    status             title  vote_average  vote_count  
0  2787965087    162.0  Released            Avatar           7.2       11800  
1   961000000    169.0  Released  Pirates of the...           6.9        4500  
2   880674609    148.0  Released           Spectre           6.3        4466  
3  1084939099    165.0  Released  The Dark Knight...           7.6        9106  
4   284139100    132.0  Released       John Carter           6.1        2124

Basic Column Selection

Single Column Selection

Use string for Series output or list for DataFrame output ?

# Returns a Series
title_series = movies_dataset["title"]
print(type(title_series))
print(title_series.head())

<class 'pandas.core.series.Series'>
0                           Avatar
1    Pirates of the Caribbean: At World's End
2                          Spectre
3              The Dark Knight Rises
4                      John Carter
Name: title, dtype: object

# Returns a DataFrame
title_df = movies_dataset[["title"]]
print(type(title_df))
print(title_df.head())

<class 'pandas.core.frame.DataFrame'>
                                   title
0                             Avatar
1  Pirates of the Caribbean: At World's End
2                            Spectre
3                The Dark Knight Rises
4                        John Carter

Multiple Column Selection

Select multiple columns by passing a list of column names ?

# Select multiple columns
columns = ["title", "runtime", "vote_average", "vote_count"]
selected_data = movies_dataset[columns]
print(selected_data.head())

                                   title  runtime  vote_average  vote_count
0                             Avatar    162.0           7.2       11800
1  Pirates of the Caribbean: At World's End    169.0           6.9        4500
2                            Spectre    148.0           6.3        4466
3                The Dark Knight Rises    165.0           7.6        9106
4                        John Carter    132.0           6.1        2124

Using filter() Method for Pattern Matching

The filter() method selects columns based on their names using patterns. It supports three parameters: like, regex, and items.

Using 'like' Parameter

Find columns containing a specific substring ?

# Select columns containing "title"
title_columns = movies_dataset.filter(like="title")
print(title_columns.head())

  original_title                       title
0         Avatar                      Avatar
1  Pirates of...  Pirates of the Caribbean: At World's End
2        Spectre                     Spectre
3  The Dark K...            The Dark Knight Rises
4    John Carter                 John Carter

Using 'regex' Parameter

Use regular expressions for more flexible column selection ?

# Select columns ending with "t"
ending_with_t = movies_dataset.filter(regex="t$")
print(ending_with_t.head())

      budget  vote_count
0  237000000       11800
1  300000000        4500
2  245000000        4466
3  250000000        9106
4  260000000        2124

# Select columns starting with "vote"
vote_columns = movies_dataset.filter(regex="^vote")
print(vote_columns.head())

   vote_average  vote_count
0           7.2       11800
1           6.9        4500
2           6.3        4466
3           7.6        9106
4           6.1        2124

Selecting Columns by Data Types

The select_dtypes() method filters columns based on their data types using include or exclude parameters.

Check Data Types

# Check data types distribution
print(movies_dataset.dtypes.value_counts())

object     5
int64      4
float64    3
dtype: int64

Select Numeric Columns

# Select all numeric columns
numeric_data = movies_dataset.select_dtypes(include=["number"])
print(numeric_data.head())

      budget     id  popularity     revenue  runtime  vote_average  vote_count
0  237000000  19995  150.437577  2787965087    162.0           7.2       11800
1  300000000    285  139.082615   961000000    169.0           6.9        4500
2  245000000 206647  107.376788   880674609    148.0           6.3        4466
3  250000000  49026  112.312950  1084939099    165.0           7.6        9106
4  260000000  49529   43.926995   284139100    132.0           6.1        2124

Select Specific Data Types

# Select only integer columns
int_columns = movies_dataset.select_dtypes(include=["int64"])
print(int_columns.head())

      budget     id     revenue  vote_count
0  237000000  19995  2787965087       11800
1  300000000    285   961000000        4500
2  245000000 206647   880674609        4466
3  250000000  49026  1084939099        9106
4  260000000  49529   284139100        2124

Exclude Data Types

# Exclude object columns (text data)
non_object_data = movies_dataset.select_dtypes(exclude=["object"])
print(non_object_data.head())

      budget     id  popularity     revenue  runtime  vote_average  vote_count
0  237000000  19995  150.437577  2787965087    162.0           7.2       11800
1  300000000    285  139.082615   961000000    169.0           6.9        4500
2  245000000 206647  107.376788   880674609    148.0           6.3        4466
3  250000000  49026  112.312950  1084939099    165.0           7.6        9106
4  260000000  49529   43.926995   284139100    132.0           6.1        2124

Comparison of Selection Methods

Method	Use Case	Example
Column names	Exact column selection	`df[["col1", "col2"]]`
`filter(like=)`	Partial name matching	`df.filter(like="vote")`
`filter(regex=)`	Pattern-based selection	`df.filter(regex="^vote")`
`select_dtypes()`	Data type-based selection	`df.select_dtypes(include=["number"])`

Conclusion

Use filter() with regex patterns for flexible column name matching, and select_dtypes() for data type-based selection. These methods provide powerful ways to subset DataFrames efficiently.

Kiran P

Updated on: 2026-03-25T12:18:44+05:30

908 Views

Previous Next