How to select multiple DataFrame columns using regexp and datatypes

PythonServer Side ProgrammingProgramming

DataFrame maybe compared to a data set held in a spreadsheet or a database with rows and columns. DataFrame is a 2D Object.

Ok, confused with 1D and 2D terminology ?

The major difference between 1D (Series) and 2D (DataFrame) is the number of points of information you need to inorer to arrive at any single data point. If you take an example of a Series, and wanted to extract a value, you only need one point of reference i.e. row index.

In comparsion to a table (DataFrame), one point of reference is not sufficient to get to a data point, you need an intersection of row value and column value.

Below snippet shows how to create a Pandas DataFrame from a csv file.

.read_csv() method by default creates a DataFrame. You can download the movies dataset from kaggle.com by searching for movies.

"""Script : Create a Pandas DataFrame from a csv file."""
import pandas as pd
movies_dataset pd read_csv "https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv")
# 1 print the type of object
type(movies_dataset)
# 2 print the top 5 records in a tabular format
movies_dataset head(5)



budget
id
original_language
original_title
popularity
release_date
revenue
runtime
status
title
vote_average
vote_count
0
237000000
19995
en
Avatar
150.437577
10/12/2009
2787965087
162.0
Released
Avatar
7.2
11800
1
300000000
285
en
Pirates of the Caribbean: At World's End
139.082615
19/05/2007
961000000
169.0
Released
Pirates of the Caribbean: At World's End
6.9
4500
2
245000000
206647
en
Spectre
107.376788
26/10/2015
880674609
148.0
Released
Spectre
6.3
4466
3
250000000
49026
en
The Dark Knight Rises
112.312950
16/07/2012
1084939099
165.0
Released
The Dark Knight Rises
7.6
9106
4
260000000
49529
en
John Carter
43.926995
7/03/2012
284139100
132.0
Released
John Carter
6.1
2124

2. Selecting a single DataFrame column. Passing column name as a string or list to the index operator will return the column values as either Series or DataFrame.

If we pass in a string with the column name, you will get a Series as output, however, passing the list with just one column name will return the DataFrame. We will see this with examples.

# select the data as series movies_dataset["title"]


0 Avatar
1 Pirates of the Caribbean: At World's End
2 Spectre
3 The Dark Knight Rises
4 John Carter
...
4798 El Mariachi
4799 Newlyweds
4800 Signed, Sealed, Delivered
4801 Shanghai Calling
4802 My Date with Drew
Name: title, Length: 4803, dtype: object


# select the data as DataFrame movies_dataset[["title"]]



title
0
Avatar
1
Pirates of the Caribbean: At World's End
2
Spectre
3
The Dark Knight Rises
4
John Carter
...
...
4798
El Mariachi
4799
Newlyweds
4800
Signed, Sealed, Delivered
4801
Shanghai Calling
4802
My Date with Drew

3. Selecting multiple DataFrame column.

# Multiple DataFrame columns movies_dataset[["title""runtime","vote_average","vote_count"]]



title
runtime
vote_average
vote_count
0
Avatar
162.0
7.2
11800
1
Pirates of the Caribbean: At World's End
169.0
6.9
4500
2
Spectre
148.0
6.3
4466
3
The Dark Knight Rises
165.0
7.6
9106
4
John Carter
132.0
6.1
2124
...
...
...
...
...
4798
El Mariachi
81.0
6.6
238
4799
Newlyweds
85.0
5.9
5
4800
Signed, Sealed, Delivered
120.0
7.0
6
4801
Shanghai Calling
98.0
5.7
7
4802
My Date with Drew
90.0
6.3
16

To avoid code readability issues, I always recommend defining a variable for holding the names of the columns as a list and use the column name instead of specifying multiple column names within the code.

columns=["title","runtime","vote_average","vote_count"]movies_dataset[columns]



title
runtime
vote_average
vote_count
0
Avatar
162.0
7.2
11800
1
Pirates of the Caribbean: At World's End
169.0
6.9
4500
2
Spectre
148.0
6.3
4466
3
The Dark Knight Rises
165.0
7.6
9106
4
John Carter
132.0
6.1
2124
...
...
...
...
...
4798
El Mariachi
81.0
6.6
238
4799
Newlyweds
85.0
5.9
5
4800
Signed, Sealed, Delivered
120.0
7.0
6
4801
Shanghai Calling
98.0
5.7
7
4802
My Date with Drew
90.0
6.3
16

4.DataFrame columns by column names.

.filter() method

This method is quite handy to search and select the columns using a string. This works pretty much the same way as the like %% parameter in SQL. Remember, .filter() method selects columns by only inspecting the column names and not the actual data values.

.filter() method supports three parameters that can be used for selection operation.

.like
.regex
.items

like parameter takes a string and attempts to find the column names that contain this string somewhere in the column name.


# Select the column that have a column name like "title" movies_dataset.filter(like="title").head(5)



original_title
title
0
Avatar
Avatar
1
Pirates of the Caribbean: At World's End
Pirates of the Caribbean: At World's End
2
Spectre
Spectre
3
The Dark Knight Rises
The Dark Knight Rises
4
John Carter
John Carter

.regex – More flexible way of selecting the columns names using regualr expressions

# Select the columns that end with "t"
movies_dataset.filter(regex=t).head()



budget
vote_count
0
237000000
11800
1
300000000
4500
2
245000000
4466
3
250000000
9106
4
260000000
2124

.items – Duplicate of passing column name as string or list to index operator, except KeyError will not be raised

DataFrame columns by data types.

The .select_dtypes method works on the column data types if you are interested to filter and use only certain data types.

Again, .select_dtypes method accepts multiple data types (by a list) or single data type (as a String) in its include or exclude parameters and returns a DataFrame with columns of just those given data types.

.include parameter includes the columns with the specified data type/s and .exclude will ignore the columns with the specified data type.

Let’s first look at the data types and count of columns with those data types

movies_dataset=pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv"
movies_dataset.dtypes.value_counts()


object 5
int64 4
float64 3
dtype: int64


a) Filtering integer data types from pandas DataFrames.

movies_dataset select_dtypes(include="int")head(3)


2
1
0


b). Select integer and float data types from pandas DataFrames.

You can specify multiple data types as a list show below.

movies_dataset select_dtypes(include=["int64","float"]).head(3)



budget
id
popularity
revenue
runtime
vote_average
vote_count
0
237000000
19995
150.437577
2787965087
162.0
7.2
11800
1
300000000
285
139.082615
961000000
169.0
6.9
4500
2
245000000
206647
107.376788
880674609
148.0
6.3
4466

c) Well, if you just want all numeric data types, just specify number

movies_dataset select_dtypes(include=["number"]).head(3)



budget
id
popularity
revenue
runtime
vote_average
vote_count
0
237000000
19995
150.437577
2787965087
162.0
7.2
11800
1
300000000
285
139.082615
961000000
169.0
6.9
4500
2
245000000
206647
107.376788
880674609
148.0
6.3
4466

d). Exclude certain data types from pandas DataFrames.


movies_dataset select_dtypes(exclude=["object"]).head(3)



budget
id
popularity
revenue
runtime
vote_average
vote_count
0
237000000
19995
150.437577
2787965087
162.0
7.2
11800
1
300000000
285
139.082615
961000000
169.0
6.9
4500
2
245000000
206647
107.376788
880674609
148.0
6.3
4466


NOTE:- There is not string data type to deal with, pandas convert them to Object, so if you face an exception “TypeError: data type “string” not understood” , replace string with Object instead.

raja
Published on 10-Nov-2020 09:41:03
Advertisements