- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
Physics
Chemistry
Biology
Mathematics
English
Economics
Psychology
Social Studies
Fashion Studies
Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Drop rows in PySpark DataFrame with condition
Applying conditions on a data frame can be very beneficial for a programmer. We can validate data to make sure that it fits our model. We can manipulate the data frame by applying conditions and filter out irrelevant data from the data frame which improves data visualization. In this article, we will perform a similar operation of applying conditions to a PySpark data frame and dropping rows from it. Pyspark offers real time data processing. It is an API of Apache spark which allows the programmer to create spark frameworks in a local python environment.
Example
Now that we have acquired a basic understanding of what PySpark data frames are, let’s create one. Here,
We imported a “SparkSession” from “spark.sql”. This session acts as an entry point to use the “Spark APIs” and allows us to configure the data frame according to our own wish. We only create one SparkSession for an application and then use it throughout the codebase.
We passed a dictionary dataset that contains information about dishes, including their names, price, ratings and discount. We used the dataset to create a pandas data frame.
The pandas data frame is then passed as an argument for the “.createDataFrame” method to create a spark data frame.
At last we displayed the data frame using the “show()” method.
from pyspark.sql import SparkSession import pandas as pd sparkOBJ = SparkSession.builder.appName("DscSpark").getOrCreate() dataset = {"Dish name": ["Biryani", "Fish", "Mashed potatoes", "Salad"], "Price": [250, 350, 180, 200], "Rating":[9, 7, 8, 6], "discount %":[20, 30, 10, 15]} dataframe_pd = pd.DataFrame(dataset, index=[1, 2, 3, 4]) dataframe_spk = sparkOBJ.createDataFrame(dataframe_pd) print("The original data frame is: -") dataframe_spk.show()
Output
The original data frame is: - +---------------+-----+------+----------+ | Dish name|Price|Rating|discount %| +---------------+-----+------+----------+ | Biryani| 250| 9| 20| | Fish| 350| 7| 30| |Mashed potatoes| 180| 8| 10| | Salad| 200| 6| 15| +---------------+-----+------+----------+
Now that we have created a spark data frame, we will apply conditions on its columns to drop rows from it.
Applying Condition on a Single Column
We will start with a simple of task of targeting a single column. Let’s build the code here,
After creating the data frame, we used the “filter()” function to remove those rows where the “Rating” column value is more than 8.
The 3rd and the 4th row were retained.
We created a new data frame to save the changes that were made to the original data frame.
Example
from pyspark.sql import SparkSession import pandas as pd sparkOBJ = SparkSession.builder.appName("DscSpark").getOrCreate() dataset = {"Dish name": ["Biryani", "Fish", "Mashed potatoes", "Salad"], "Price": [250, 350, 180, 200], "Rating":[9, 7, 8, 6], "discount %":[20, 30, 10, 15]} dataframe_pd = pd.DataFrame(dataset) dataframe_spk = sparkOBJ.createDataFrame(dataframe_pd) print("The original data frame is: -") dataframe_spk.show() dataframe_fil = dataframe_spk.filter(dataframe_spk.Rating < 8) dataframe_fil.show()
Output
The original data frame is: - +---------------+-----+------+----------+ | Dish name|Price|Rating|discount %| +---------------+-----+------+----------+ | Biryani| 250| 9| 20| | Fish| 350| 7| 30| |Mashed potatoes| 180| 8| 10| | Salad| 200| 6| 15| +---------------+-----+------+----------+ +--------+-----+------+----------+ |Dishname|Price|Rating|discount %| +--------+-----+------+----------+ | Fish| 350| 7| 30| | Salad| 200| 6| 15| +--------+-----+------+----------+
Applying Condition on Multiple Columns
In order to increase the specificity of the data frame and ease data analysis, we can apply certain conditions on multiple columns of the data frame. This approach increases the efficiency with which the data is processed by eliminating unnecessary rows from the data frame.
We will use the “&” operator to target multiple columns because in case of spark data frame, the expressions are evaluated element-wise across all the rows. Therefore, we require an “element-wise logical operator”.
Example
Let’s look at the code for a better understanding.
After creating the data frame, we used the filter() function to drop the rows where the “Rating” column value is less than 7 and the “Price” column is greater than 300.
The corresponding rows to the columns that satisfy the conditions are retained i.e., “Row 1” and “Row 3”.
from pyspark.sql import SparkSession import pandas as pd sparkOBJ = SparkSession.builder.appName("DscSpark").getOrCreate() dataset = {"Dish name": ["Biryani", "Fish", "Mashed potatoes", "Salad"], "Price": [250, 350, 180, 200], "Rating":[9, 7, 8, 6], "discount%":[20, 30, 10, 15]} dataframe_pd = pd.DataFrame(dataset) dataframe_spk = sparkOBJ.createDataFrame(dataframe_pd) print("The original data frame is: -") dataframe_spk.show() dataframe_fil = dataframe_spk.filter((dataframe_spk.Rating > 7) & (dataframe_spk.Price < 300)) dataframe_fil.show()
Output
The original data frame is: - +---------------+-----+------+---------+ | Dish name|Price|Rating|discount%| +---------------+-----+------+---------+ | Biryani| 250| 9| 20| | Fish| 350| 7| 30| |Mashed potatoes| 180| 8| 10| | Salad| 200| 6| 15| +---------------+-----+------+---------+ +--------+-----+------+---------+------+ | Dish name|Price|Rating|discount%| +--------+-----+------+---------+------+ | Biryani| 250| 9| 20| |Mashed potatoes| 180| 8| 10| +--------+-----+------+---------+------+
Conclusion
In this article, we discussed the different methods to drop rows from a PySpark data frame by applying conditions to the columns. We created a data frame and then targeted a single column. After this we applied we conditions to multiple columns and dropped the rows.