
- Python Basic Tutorial
- Python - Home
- Python - Overview
- Python - Environment Setup
- Python - Basic Syntax
- Python - Comments
- Python - Variables
- Python - Data Types
- Python - Operators
- Python - Decision Making
- Python - Loops
- Python - Numbers
- Python - Strings
- Python - Lists
- Python - Tuples
- Python - Dictionary
- Python - Date & Time
- Python - Functions
- Python - Modules
- Python - Files I/O
- Python - Exceptions
How to select multiple DataFrame columns using regexp and datatypes
DataFrame maybe compared to a data set held in a spreadsheet or a database with rows and columns. DataFrame is a 2D Object.
Ok, confused with 1D and 2D terminology ?
The major difference between 1D (Series) and 2D (DataFrame) is the number of points of information you need to inorer to arrive at any single data point. If you take an example of a Series, and wanted to extract a value, you only need one point of reference i.e. row index.
In comparsion to a table (DataFrame), one point of reference is not sufficient to get to a data point, you need an intersection of row value and column value.
Below snippet shows how to create a Pandas DataFrame from a csv file.
.read_csv() method by default creates a DataFrame. You can download the movies dataset from kaggle.com by searching for movies.
"""Script : Create a Pandas DataFrame from a csv file.""" import pandas as pd movies_dataset pd read_csv "https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv") # 1 print the type of object type(movies_dataset) # 2 print the top 5 records in a tabular format movies_dataset head(5)
budget | id | original_language | original_title | popularity | release_date | revenue | runtime | status | title | vote_average | vote_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 237000000 | 19995 | en | Avatar | 150.437577 | 10/12/2009 | 2787965087 | 162.0 | Released | Avatar | 7.2 | 11800 |
1 | 300000000 | 285 | en | Pirates of the Caribbean: At World's End | 139.082615 | 19/05/2007 | 961000000 | 169.0 | Released | Pirates of the Caribbean: At World's End | 6.9 | 4500 |
2 | 245000000 | 206647 | en | Spectre | 107.376788 | 26/10/2015 | 880674609 | 148.0 | Released | Spectre | 6.3 | 4466 |
3 | 250000000 | 49026 | en | The Dark Knight Rises | 112.312950 | 16/07/2012 | 1084939099 | 165.0 | Released | The Dark Knight Rises | 7.6 | 9106 |
4 | 260000000 | 49529 | en | John Carter | 43.926995 | 7/03/2012 | 284139100 | 132.0 | Released | John Carter | 6.1 | 2124 |
2. Selecting a single DataFrame column. Passing column name as a string or list to the index operator will return the column values as either Series or DataFrame.
If we pass in a string with the column name, you will get a Series as output, however, passing the list with just one column name will return the DataFrame. We will see this with examples.
# select the data as series movies_dataset["title"]
0 Avatar 1 Pirates of the Caribbean: At World's End 2 Spectre 3 The Dark Knight Rises 4 John Carter ... 4798 El Mariachi 4799 Newlyweds 4800 Signed, Sealed, Delivered 4801 Shanghai Calling 4802 My Date with Drew Name: title, Length: 4803, dtype: object
# select the data as DataFrame movies_dataset[["title"]]
title | |
---|---|
0 | Avatar |
1 | Pirates of the Caribbean: At World's End |
2 | Spectre |
3 | The Dark Knight Rises |
4 | John Carter |
... | ... |
4798 | El Mariachi |
4799 | Newlyweds |
4800 | Signed, Sealed, Delivered |
4801 | Shanghai Calling |
4802 | My Date with Drew |
3. Selecting multiple DataFrame column.
# Multiple DataFrame columns movies_dataset[["title""runtime","vote_average","vote_count"]]
title | runtime | vote_average | vote_count | |
---|---|---|---|---|
0 | Avatar | 162.0 | 7.2 | 11800 |
1 | Pirates of the Caribbean: At World's End | 169.0 | 6.9 | 4500 |
2 | Spectre | 148.0 | 6.3 | 4466 |
3 | The Dark Knight Rises | 165.0 | 7.6 | 9106 |
4 | John Carter | 132.0 | 6.1 | 2124 |
... | ... | ... | ... | ... |
4798 | El Mariachi | 81.0 | 6.6 | 238 |
4799 | Newlyweds | 85.0 | 5.9 | 5 |
4800 | Signed, Sealed, Delivered | 120.0 | 7.0 | 6 |
4801 | Shanghai Calling | 98.0 | 5.7 | 7 |
4802 | My Date with Drew | 90.0 | 6.3 | 16 |
To avoid code readability issues, I always recommend defining a variable for holding the names of the columns as a list and use the column name instead of specifying multiple column names within the code.
columns=["title","runtime","vote_average","vote_count"]movies_dataset[columns]
title | runtime | vote_average | vote_count | |
---|---|---|---|---|
0 | Avatar | 162.0 | 7.2 | 11800 |
1 | Pirates of the Caribbean: At World's End | 169.0 | 6.9 | 4500 |
2 | Spectre | 148.0 | 6.3 | 4466 |
3 | The Dark Knight Rises | 165.0 | 7.6 | 9106 |
4 | John Carter | 132.0 | 6.1 | 2124 |
... | ... | ... | ... | ... |
4798 | El Mariachi | 81.0 | 6.6 | 238 |
4799 | Newlyweds | 85.0 | 5.9 | 5 |
4800 | Signed, Sealed, Delivered | 120.0 | 7.0 | 6 |
4801 | Shanghai Calling | 98.0 | 5.7 | 7 |
4802 | My Date with Drew | 90.0 | 6.3 | 16 |
4.DataFrame columns by column names.
.filter() method
This method is quite handy to search and select the columns using a string. This works pretty much the same way as the like %% parameter in SQL. Remember, .filter() method selects columns by only inspecting the column names and not the actual data values.
.filter() method supports three parameters that can be used for selection operation.
.like
.regex
.items
like parameter takes a string and attempts to find the column names that contain this string somewhere in the column name.
# Select the column that have a column name like "title" movies_dataset.filter(like="title").head(5)
original_title | title | |
---|---|---|
0 | Avatar | Avatar |
1 | Pirates of the Caribbean: At World's End | Pirates of the Caribbean: At World's End |
2 | Spectre | Spectre |
3 | The Dark Knight Rises | The Dark Knight Rises |
4 | John Carter | John Carter |
.regex – More flexible way of selecting the columns names using regualr expressions
# Select the columns that end with "t" movies_dataset.filter(regex=t).head()
budget | vote_count | |
---|---|---|
0 | 237000000 | 11800 |
1 | 300000000 | 4500 |
2 | 245000000 | 4466 |
3 | 250000000 | 9106 |
4 | 260000000 | 2124 |
.items – Duplicate of passing column name as string or list to index operator, except KeyError will not be raised
DataFrame columns by data types.
The .select_dtypes method works on the column data types if you are interested to filter and use only certain data types.
Again, .select_dtypes method accepts multiple data types (by a list) or single data type (as a String) in its include or exclude parameters and returns a DataFrame with columns of just those given data types.
.include parameter includes the columns with the specified data type/s and .exclude will ignore the columns with the specified data type.
Let’s first look at the data types and count of columns with those data types
movies_dataset=pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv" movies_dataset.dtypes.value_counts()
object 5 int64 4 float64 3 dtype: int64
a) Filtering integer data types from pandas DataFrames.
movies_dataset select_dtypes(include="int")head(3)
2 |
---|
1 |
0 |
b). Select integer and float data types from pandas DataFrames.
You can specify multiple data types as a list show below.
movies_dataset select_dtypes(include=["int64","float"]).head(3)
budget | id | popularity | revenue | runtime | vote_average | vote_count | |
---|---|---|---|---|---|---|---|
0 | 237000000 | 19995 | 150.437577 | 2787965087 | 162.0 | 7.2 | 11800 |
1 | 300000000 | 285 | 139.082615 | 961000000 | 169.0 | 6.9 | 4500 |
2 | 245000000 | 206647 | 107.376788 | 880674609 | 148.0 | 6.3 | 4466 |
c) Well, if you just want all numeric data types, just specify number
movies_dataset select_dtypes(include=["number"]).head(3)
budget | id | popularity | revenue | runtime | vote_average | vote_count | |
---|---|---|---|---|---|---|---|
0 | 237000000 | 19995 | 150.437577 | 2787965087 | 162.0 | 7.2 | 11800 |
1 | 300000000 | 285 | 139.082615 | 961000000 | 169.0 | 6.9 | 4500 |
2 | 245000000 | 206647 | 107.376788 | 880674609 | 148.0 | 6.3 | 4466 |
d). Exclude certain data types from pandas DataFrames.
movies_dataset select_dtypes(exclude=["object"]).head(3)
budget | id | popularity | revenue | runtime | vote_average | vote_count | |
---|---|---|---|---|---|---|---|
0 | 237000000 | 19995 | 150.437577 | 2787965087 | 162.0 | 7.2 | 11800 |
1 | 300000000 | 285 | 139.082615 | 961000000 | 169.0 | 6.9 | 4500 |
2 | 245000000 | 206647 | 107.376788 | 880674609 | 148.0 | 6.3 | 4466 |
NOTE:- There is not string data type to deal with, pandas convert them to Object, so if you face an exception “TypeError: data type “string” not understood” , replace string with Object instead.
- Related Articles
- Select multiple columns in a Pandas DataFrame
- Python - Select multiple columns from a Pandas dataframe
- Python - Select columns with specific datatypes
- Plot multiple columns of Pandas DataFrame using Seaborn
- How to sort multiple columns of a Pandas DataFrame?
- Python Pandas - How to select multiple rows from a DataFrame
- How to select all columns except one in a Pandas DataFrame?
- How to add multiple text labels from DataFrame columns in Python Plotly?
- MySQL Select Statement DISTINCT for Multiple Columns?
- Python Pandas - Plot multiple data columns in a DataFrame?
- Python – Get the datatypes of columns
- Select multiple columns and display in a single column in MySQL?
- Select multiple sums with MySQL query and display them in separate columns?
- Plot multiple columns of Pandas dataframe on the bar chart in Matplotlib
- How to AutoSum Multiple Rows and Columns in Excel?
