How to select multiple DataFrame columns using regexp and datatypes

Python Server Side Programming Programming

DataFrame maybe compared to a data set held in a spreadsheet or a database with rows and columns. DataFrame is a 2D Object.

Ok, confused with 1D and 2D terminology ?

The major difference between 1D (Series) and 2D (DataFrame) is the number of points of information you need to inorer to arrive at any single data point. If you take an example of a Series, and wanted to extract a value, you only need one point of reference i.e. row index.

In comparsion to a table (DataFrame), one point of reference is not sufficient to get to a data point, you need an intersection of row value and column value.

Below snippet shows how to create a Pandas DataFrame from a csv file.

.read_csv() method by default creates a DataFrame. You can download the movies dataset from kaggle.com by searching for movies.

"""Script : Create a Pandas DataFrame from a csv file."""
import pandas as pd
movies_dataset pd read_csv "https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv")
# 1 print the type of object
type(movies_dataset)
# 2 print the top 5 records in a tabular format
movies_dataset head(5)

	budget	id	original_language	original_title	popularity	release_date	revenue	runtime	status	title	vote_average	vote_count
0	237000000	19995	en	Avatar	150.437577	10/12/2009	2787965087	162.0	Released	Avatar	7.2	11800
1	300000000	285	en	Pirates of the Caribbean: At World's End	139.082615	19/05/2007	961000000	169.0	Released	Pirates of the Caribbean: At World's End	6.9	4500
2	245000000	206647	en	Spectre	107.376788	26/10/2015	880674609	148.0	Released	Spectre	6.3	4466
3	250000000	49026	en	The Dark Knight Rises	112.312950	16/07/2012	1084939099	165.0	Released	The Dark Knight Rises	7.6	9106
4	260000000	49529	en	John Carter	43.926995	7/03/2012	284139100	132.0	Released	John Carter	6.1	2124

2. Selecting a single DataFrame column. Passing column name as a string or list to the index operator will return the column values as either Series or DataFrame.

If we pass in a string with the column name, you will get a Series as output, however, passing the list with just one column name will return the DataFrame. We will see this with examples.

# select the data as series movies_dataset["title"]

0 Avatar
1 Pirates of the Caribbean: At World's End
2 Spectre
3 The Dark Knight Rises
4 John Carter
...
4798 El Mariachi
4799 Newlyweds
4800 Signed, Sealed, Delivered
4801 Shanghai Calling
4802 My Date with Drew
Name: title, Length: 4803, dtype: object

# select the data as DataFrame movies_dataset[["title"]]

	title
0	Avatar
1	Pirates of the Caribbean: At World's End
2	Spectre
3	The Dark Knight Rises
4	John Carter
...	...
4798	El Mariachi
4799	Newlyweds
4800	Signed, Sealed, Delivered
4801	Shanghai Calling
4802	My Date with Drew

3. Selecting multiple DataFrame column.

# Multiple DataFrame columns movies_dataset[["title""runtime","vote_average","vote_count"]]

	title	runtime	vote_average	vote_count
0	Avatar	162.0	7.2	11800
1	Pirates of the Caribbean: At World's End	169.0	6.9	4500
2	Spectre	148.0	6.3	4466
3	The Dark Knight Rises	165.0	7.6	9106
4	John Carter	132.0	6.1	2124
...	...	...	...	...
4798	El Mariachi	81.0	6.6	238
4799	Newlyweds	85.0	5.9	5
4800	Signed, Sealed, Delivered	120.0	7.0	6
4801	Shanghai Calling	98.0	5.7	7
4802	My Date with Drew	90.0	6.3	16

To avoid code readability issues, I always recommend defining a variable for holding the names of the columns as a list and use the column name instead of specifying multiple column names within the code.

columns=["title","runtime","vote_average","vote_count"]movies_dataset[columns]

	title	runtime	vote_average	vote_count
0	Avatar	162.0	7.2	11800
1	Pirates of the Caribbean: At World's End	169.0	6.9	4500
2	Spectre	148.0	6.3	4466
3	The Dark Knight Rises	165.0	7.6	9106
4	John Carter	132.0	6.1	2124
...	...	...	...	...
4798	El Mariachi	81.0	6.6	238
4799	Newlyweds	85.0	5.9	5
4800	Signed, Sealed, Delivered	120.0	7.0	6
4801	Shanghai Calling	98.0	5.7	7
4802	My Date with Drew	90.0	6.3	16

4.DataFrame columns by column names.

.filter() method

This method is quite handy to search and select the columns using a string. This works pretty much the same way as the like %% parameter in SQL. Remember, .filter() method selects columns by only inspecting the column names and not the actual data values.

.filter() method supports three parameters that can be used for selection operation.

.like
.regex
.items

like parameter takes a string and attempts to find the column names that contain this string somewhere in the column name.

# Select the column that have a column name like "title" movies_dataset.filter(like="title").head(5)

	original_title	title
0	Avatar	Avatar
1	Pirates of the Caribbean: At World's End	Pirates of the Caribbean: At World's End
2	Spectre	Spectre
3	The Dark Knight Rises	The Dark Knight Rises
4	John Carter	John Carter

.regex – More flexible way of selecting the columns names using regualr expressions

# Select the columns that end with "t"
movies_dataset.filter(regex=t).head()

	budget	vote_count
0	237000000	11800
1	300000000	4500
2	245000000	4466
3	250000000	9106
4	260000000	2124

.items – Duplicate of passing column name as string or list to index operator, except KeyError will not be raised

DataFrame columns by data types.

The .select_dtypes method works on the column data types if you are interested to filter and use only certain data types.

Again, .select_dtypes method accepts multiple data types (by a list) or single data type (as a String) in its include or exclude parameters and returns a DataFrame with columns of just those given data types.

.include parameter includes the columns with the specified data type/s and .exclude will ignore the columns with the specified data type.

Let’s first look at the data types and count of columns with those data types

movies_dataset=pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv"
movies_dataset.dtypes.value_counts()

object 5
int64 4
float64 3
dtype: int64

a) Filtering integer data types from pandas DataFrames.

movies_dataset select_dtypes(include="int")head(3)

2
1
0

b). Select integer and float data types from pandas DataFrames.

You can specify multiple data types as a list show below.

movies_dataset select_dtypes(include=["int64","float"]).head(3)

	budget	id	popularity	revenue	runtime	vote_average	vote_count
0	237000000	19995	150.437577	2787965087	162.0	7.2	11800
1	300000000	285	139.082615	961000000	169.0	6.9	4500
2	245000000	206647	107.376788	880674609	148.0	6.3	4466

c) Well, if you just want all numeric data types, just specify number

movies_dataset select_dtypes(include=["number"]).head(3)

	budget	id	popularity	revenue	runtime	vote_average	vote_count
0	237000000	19995	150.437577	2787965087	162.0	7.2	11800
1	300000000	285	139.082615	961000000	169.0	6.9	4500
2	245000000	206647	107.376788	880674609	148.0	6.3	4466

d). Exclude certain data types from pandas DataFrames.

movies_dataset select_dtypes(exclude=["object"]).head(3)

	budget	id	popularity	revenue	runtime	vote_average	vote_count
0	237000000	19995	150.437577	2787965087	162.0	7.2	11800
1	300000000	285	139.082615	961000000	169.0	6.9	4500
2	245000000	206647	107.376788	880674609	148.0	6.3	4466

NOTE:- There is not string data type to deal with, pandas convert them to Object, so if you face an exception “TypeError: data type “string” not understood” , replace string with Object instead.

Kiran P

Updated on: 10-Nov-2020

628 Views

Kickstart Your Career

Get certified by completing the course

Get Started