Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to select multiple DataFrame columns using regexp and datatypes
A DataFrame is a two-dimensional tabular data structure with rows and columns, similar to a spreadsheet or database table. Selecting specific columns efficiently is crucial for data analysis. This article demonstrates how to select DataFrame columns using regular expressions and data types.
Creating a Sample DataFrame
Let's start by creating a DataFrame from a movies dataset ?
import pandas as pd
# Create DataFrame from CSV
movies_dataset = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv")
# Display basic info
print(type(movies_dataset))
print(movies_dataset.head())
<class 'pandas.core.frame.DataFrame'>
budget id original_language original_title popularity release_date \
0 237000000 19995 en Avatar 150.437577 10/12/2009
1 300000000 285 en Pirates of... 139.082615 19/05/2007
2 245000000 206647 en Spectre 107.376788 26/10/2015
3 250000000 49026 en The Dark K... 112.312950 16/07/2012
4 260000000 49529 en John Carter 43.926995 7/03/2012
revenue runtime status title vote_average vote_count
0 2787965087 162.0 Released Avatar 7.2 11800
1 961000000 169.0 Released Pirates of the... 6.9 4500
2 880674609 148.0 Released Spectre 6.3 4466
3 1084939099 165.0 Released The Dark Knight... 7.6 9106
4 284139100 132.0 Released John Carter 6.1 2124
Basic Column Selection
Single Column Selection
Use string for Series output or list for DataFrame output ?
# Returns a Series title_series = movies_dataset["title"] print(type(title_series)) print(title_series.head())
<class 'pandas.core.series.Series'> 0 Avatar 1 Pirates of the Caribbean: At World's End 2 Spectre 3 The Dark Knight Rises 4 John Carter Name: title, dtype: object
# Returns a DataFrame title_df = movies_dataset[["title"]] print(type(title_df)) print(title_df.head())
<class 'pandas.core.frame.DataFrame'>
title
0 Avatar
1 Pirates of the Caribbean: At World's End
2 Spectre
3 The Dark Knight Rises
4 John Carter
Multiple Column Selection
Select multiple columns by passing a list of column names ?
# Select multiple columns columns = ["title", "runtime", "vote_average", "vote_count"] selected_data = movies_dataset[columns] print(selected_data.head())
title runtime vote_average vote_count
0 Avatar 162.0 7.2 11800
1 Pirates of the Caribbean: At World's End 169.0 6.9 4500
2 Spectre 148.0 6.3 4466
3 The Dark Knight Rises 165.0 7.6 9106
4 John Carter 132.0 6.1 2124
Using filter() Method for Pattern Matching
The filter() method selects columns based on their names using patterns. It supports three parameters: like, regex, and items.
Using 'like' Parameter
Find columns containing a specific substring ?
# Select columns containing "title" title_columns = movies_dataset.filter(like="title") print(title_columns.head())
original_title title 0 Avatar Avatar 1 Pirates of... Pirates of the Caribbean: At World's End 2 Spectre Spectre 3 The Dark K... The Dark Knight Rises 4 John Carter John Carter
Using 'regex' Parameter
Use regular expressions for more flexible column selection ?
# Select columns ending with "t" ending_with_t = movies_dataset.filter(regex="t$") print(ending_with_t.head())
budget vote_count
0 237000000 11800
1 300000000 4500
2 245000000 4466
3 250000000 9106
4 260000000 2124
# Select columns starting with "vote" vote_columns = movies_dataset.filter(regex="^vote") print(vote_columns.head())
vote_average vote_count 0 7.2 11800 1 6.9 4500 2 6.3 4466 3 7.6 9106 4 6.1 2124
Selecting Columns by Data Types
The select_dtypes() method filters columns based on their data types using include or exclude parameters.
Check Data Types
# Check data types distribution print(movies_dataset.dtypes.value_counts())
object 5 int64 4 float64 3 dtype: int64
Select Numeric Columns
# Select all numeric columns numeric_data = movies_dataset.select_dtypes(include=["number"]) print(numeric_data.head())
budget id popularity revenue runtime vote_average vote_count
0 237000000 19995 150.437577 2787965087 162.0 7.2 11800
1 300000000 285 139.082615 961000000 169.0 6.9 4500
2 245000000 206647 107.376788 880674609 148.0 6.3 4466
3 250000000 49026 112.312950 1084939099 165.0 7.6 9106
4 260000000 49529 43.926995 284139100 132.0 6.1 2124
Select Specific Data Types
# Select only integer columns int_columns = movies_dataset.select_dtypes(include=["int64"]) print(int_columns.head())
budget id revenue vote_count
0 237000000 19995 2787965087 11800
1 300000000 285 961000000 4500
2 245000000 206647 880674609 4466
3 250000000 49026 1084939099 9106
4 260000000 49529 284139100 2124
Exclude Data Types
# Exclude object columns (text data) non_object_data = movies_dataset.select_dtypes(exclude=["object"]) print(non_object_data.head())
budget id popularity revenue runtime vote_average vote_count
0 237000000 19995 150.437577 2787965087 162.0 7.2 11800
1 300000000 285 139.082615 961000000 169.0 6.9 4500
2 245000000 206647 107.376788 880674609 148.0 6.3 4466
3 250000000 49026 112.312950 1084939099 165.0 7.6 9106
4 260000000 49529 43.926995 284139100 132.0 6.1 2124
Comparison of Selection Methods
| Method | Use Case | Example |
|---|---|---|
| Column names | Exact column selection | df[["col1", "col2"]] |
filter(like=) |
Partial name matching | df.filter(like="vote") |
filter(regex=) |
Pattern-based selection | df.filter(regex="^vote") |
select_dtypes() |
Data type-based selection | df.select_dtypes(include=["number"]) |
Conclusion
Use filter() with regex patterns for flexible column name matching, and select_dtypes() for data type-based selection. These methods provide powerful ways to subset DataFrames efficiently.
