- Trending Categories
- Data Structure
- Operating System
- C Programming
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
How to select subsets of data In SQL Query Style in Pandas?
In this post, I will show you how to perform Data Analysis with SQL style filtering with Pandas. Most of the corporate company’s data are stored in databases that require SQL to retrieve and manipulate it. For instance, there are companies like Oracle, IBM, Microsoft having their own databases with their own SQL implementations.
Data scientists have to deal with SQL at some stage of their career as the data is not always stored in CSV files. I personally prefer to use Oracle, as the majority of my company’s data is stored in Oracle.
Scenario – 1 Suppose we are given a task to find all the movies from our movies dataset with below conditions.
- The language of the movies should be either English(en) or Spanish(es).
- The popularity of the movies must be between 500 and 1000.
- The movie’s status must be released.
- The vote count must be greater than 5000. For the above scenario, the SQL statement would look some thing like below.
SELECT FROM WHERE title AS movie_title ,original_language AS movie_language ,popularityAS movie_popularity ,statusAS movie_status ,vote_count AS movie_vote_count movies_data original_languageIN ('en', 'es') AND status=('Released') AND popularitybetween 500 AND 1000 AND vote_count > 5000;
Now that you have seen the SQL for the requirement, let’s do this step by step using pandas. I will show you two methods.
Method 1:- Boolean Indexing
1. Load the movies_data dataset to DataFrame.
import pandas as pd movies = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv")
Assign a variable for each condition.
languages = [ "en" , "es" ] condition_on_languages = movies . original_language . isin ( languages ) condition_on_status = movies . status == "Released" condition_on_popularity = movies . popularity . between ( 500 , 1000 ) condition_on_votecount = movies . vote_count > 5000
3. Combine all the conditions(boolean arrays) together.
final_conditions = ( condition_on_languages & condition_on_status & condition_on_popularity & condition_on_votecount ) columns = [ "title" , "original_language" , "status" , "popularity" , "vote_count" ] # clubbing all together movies . loc [ final_conditions , columns ]
Method 2:- .query() method.
The .query() method is a SQL where clause style way of filtering the data. The conditions can be passed as a string to this method, however, the column names must not contain any spaces.
If you have spaces in your column names, replace them with underscores using the python replace function.
From my experience I have seen query() method when applied on a larger DataFrame is faster than the previous method.
import pandas as pd movies = pd . read_csv ( "https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv" )
4.Build the query string and execute the method.
Note the .query method does not work with triple quoted strings spanning multiple lines.
final_conditions = ( "original_language in ['en','es']" "and status == 'Released' " "and popularity > 500 " "and popularity < 1000" "and vote_count > 5000" ) final_result = movies . query ( final_conditions ) final_result
There is more, often in my coding, I have multiple values to check in my “in” clause. So the above syntax is not ideal to work with. It is possible to reference Python variables using the at symbol (@).
You can also programmatically create the values as a python List and use them with (@).
movie_languages = [ 'en' , 'es' ] final_conditions = ( "original_language in @movie_languages " "and status == 'Released' " "and popularity > 500 " "and popularity < 1000" "and vote_count > 5000" ) final_result = movies . query ( final_conditions ) final_result
- How to select subset of data with Index Labels in Python Pandas?
- How to select a Subset Of Data Using lexicographical slicing in Python Pandas?
- How to select a Subset Of Data Using lexicographical slicingin Python Pandas?
- MySQL query to select all data between range of two dates?
- How to execute SQL update query in a JSP?
- How to use parameterized SQL query in a JSP?
- How to executes an SQL SELECT statement in a JSP?
- MySQL select query to fetch data with null value?
- How to translate SQL data to XML in Oracle?
- Python - Filtering data with Pandas .query() method
- Structured Query Language (SQL)
- How to do recursive SELECT query in MySQL?
- How to use SELECT Query in Android sqlite?
- How to use alias in MySQL select query?
- How to select specific columns in MongoDB query?