Drop rows in PySpark DataFrame with condition

PySpark Python Server Side Programming Programming

Applying conditions on a data frame can be very beneficial for a programmer. We can validate data to make sure that it fits our model. We can manipulate the data frame by applying conditions and filter out irrelevant data from the data frame which improves data visualization. In this article, we will perform a similar operation of applying conditions to a PySpark data frame and dropping rows from it. Pyspark offers real time data processing. It is an API of Apache spark which allows the programmer to create spark frameworks in a local python environment.

Example

Now that we have acquired a basic understanding of what PySpark data frames are, let’s create one. Here,

We imported a “SparkSession” from “spark.sql”. This session acts as an entry point to use the “Spark APIs” and allows us to configure the data frame according to our own wish. We only create one SparkSession for an application and then use it throughout the codebase.
We passed a dictionary dataset that contains information about dishes, including their names, price, ratings and discount. We used the dataset to create a pandas data frame.
The pandas data frame is then passed as an argument for the “.createDataFrame” method to create a spark data frame.
At last we displayed the data frame using the “show()” method.

from pyspark.sql import SparkSession
import pandas as pd

sparkOBJ = SparkSession.builder.appName("DscSpark").getOrCreate()
dataset = {"Dish name": ["Biryani", "Fish", "Mashed potatoes", "Salad"], "Price": [250, 350, 180, 200], "Rating":[9, 7, 8, 6], "discount %":[20, 30, 10, 15]}

dataframe_pd = pd.DataFrame(dataset, index=[1, 2, 3, 4])

dataframe_spk = sparkOBJ.createDataFrame(dataframe_pd)
print("The original data frame is: -")
dataframe_spk.show()

Output

The original data frame is: -
+---------------+-----+------+----------+
|      Dish name|Price|Rating|discount %|
+---------------+-----+------+----------+
|        Biryani|  250|     9|        20|
|           Fish|  350|     7|        30|
|Mashed potatoes|  180|     8|        10|
|          Salad|  200|     6|        15|
+---------------+-----+------+----------+

Now that we have created a spark data frame, we will apply conditions on its columns to drop rows from it.

Applying Condition on a Single Column

We will start with a simple of task of targeting a single column. Let’s build the code here,

After creating the data frame, we used the “filter()” function to remove those rows where the “Rating” column value is more than 8.
The 3rd and the 4th row were retained.
We created a new data frame to save the changes that were made to the original data frame.

Example

from pyspark.sql import SparkSession
import pandas as pd

sparkOBJ = SparkSession.builder.appName("DscSpark").getOrCreate()
dataset = {"Dish name": ["Biryani", "Fish", "Mashed potatoes", "Salad"], "Price": [250, 350, 180, 200],
           "Rating":[9, 7, 8, 6], "discount %":[20, 30, 10, 15]}

dataframe_pd = pd.DataFrame(dataset)

dataframe_spk = sparkOBJ.createDataFrame(dataframe_pd)
print("The original data frame is: -")
dataframe_spk.show()

dataframe_fil = dataframe_spk.filter(dataframe_spk.Rating < 8)
dataframe_fil.show()

Output

The original data frame is: -
+---------------+-----+------+----------+
|      Dish name|Price|Rating|discount %|
+---------------+-----+------+----------+
|        Biryani|  250|     9|        20|
|           Fish|  350|     7|        30|
|Mashed potatoes|  180|     8|        10|
|          Salad|  200|     6|        15|
+---------------+-----+------+----------+

+--------+-----+------+----------+
|Dishname|Price|Rating|discount %|
+--------+-----+------+----------+
|    Fish|  350|     7|        30|
|   Salad|  200|     6|        15|
+--------+-----+------+----------+

Applying Condition on Multiple Columns

In order to increase the specificity of the data frame and ease data analysis, we can apply certain conditions on multiple columns of the data frame. This approach increases the efficiency with which the data is processed by eliminating unnecessary rows from the data frame.

We will use the “&” operator to target multiple columns because in case of spark data frame, the expressions are evaluated element-wise across all the rows. Therefore, we require an “element-wise logical operator”.

Example

Let’s look at the code for a better understanding.

After creating the data frame, we used the filter() function to drop the rows where the “Rating” column value is less than 7 and the “Price” column is greater than 300.
The corresponding rows to the columns that satisfy the conditions are retained i.e., “Row 1” and “Row 3”.

from pyspark.sql import SparkSession
import pandas as pd

sparkOBJ = SparkSession.builder.appName("DscSpark").getOrCreate()
dataset = {"Dish name": ["Biryani", "Fish", "Mashed potatoes", "Salad"], "Price": [250, 350, 180, 200], "Rating":[9, 7, 8, 6], "discount%":[20, 30, 10, 15]}

dataframe_pd = pd.DataFrame(dataset)

dataframe_spk = sparkOBJ.createDataFrame(dataframe_pd)
print("The original data frame is: -")
dataframe_spk.show()
dataframe_fil = dataframe_spk.filter((dataframe_spk.Rating > 7) & (dataframe_spk.Price < 300))
dataframe_fil.show()

Output

The original data frame is: -
+---------------+-----+------+---------+
|      Dish name|Price|Rating|discount%|
+---------------+-----+------+---------+
|        Biryani|  250|     9|       20|
|           Fish|  350|     7|       30|
|Mashed potatoes|  180|     8|       10|
|          Salad|  200|     6|       15|
+---------------+-----+------+---------+

+--------+-----+------+---------+------+
|      Dish name|Price|Rating|discount%|
+--------+-----+------+---------+------+
|        Biryani|  250|     9|       20|
|Mashed potatoes|  180|     8|       10|
+--------+-----+------+---------+------+

Conclusion

In this article, we discussed the different methods to drop rows from a PySpark data frame by applying conditions to the columns. We created a data frame and then targeted a single column. After this we applied we conditions to multiple columns and dropped the rows.

Devesh Chauhan

Updated on: 05-May-2023

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started