Drop rows containing specific value in pyspark dataframe


When we are dealing with complex datasets, we require frameworks that can process data quickly and provide results. This is where PySpark comes into the picture. PySpark is a tool which was developed by the Apache community to process data in real time. It is an API which is used to create data frames and interpret results in our local python environment.

The data frame can contain huge amount of information/data and in order to maintain the relevance of the data to be interpreted we make the required changes. In this article, we will manipulate a PySpark data frame and drop rows containing specific values. This operation of selective elimination of the data is very simple and can prove to be of immense importance.

Preparing a Dataset to Create a PySpark Data Frame

Before we start dropping the rows, we should prepare an appropriate dataset. With the help of this dataset, we shall create a data frame.

Example

We imported a SparkSession from “pyspark.sql”. This sparksession acts as an entry point to the Apache spark. This session manages the configuration of the framework.

We created a “SparkSession” object named “sparlOBJ” with the help of builder method. Giving the app name is not a mandatory step but it helps the programmer to find a particular session.

We passed a dataset consisting of 6 numerical sequences. We passed a list of column labels. We used the “SparkSession” instance and the “createDataFrame” method to create a “PySpark” data frame.

The dataset and the list of column names were passed as the arguments for this method and finally we printed the data frame.

from pyspark.sql import SparkSession

sparkOBJ = SparkSession.builder.appName("SPARK_DSC").getOrCreate()

dataset = [("23", "25", "18", "19"), ("18", "25", "28", "40"),
           ("15", "18", "45", "11"), ("12", "8", "18", "28"),
           ("25", "42", "18", "29"), ("45", "50", "17", "11")]
columnsLis = ["Sequence 1", "Sequence 2", "Sequence 3", "Sequence 4"]

dataframe_SPK = sparkOBJ.createDataFrame(dataset, columnsLis)
print("the original data frame is: -")
dataframe_SPK.show()

Output

the original data frame is: -
+----------+----------+----------+----------+
|Sequence 1|Sequence 2|Sequence 3|Sequence 4|
+----------+----------+----------+----------+
|        23|        25|        18|        19|
|        18|        25|        28|        40|
|        15|        18|        45|        11|
|        12|         8|        18|        28|
|        25|        42|        18|        29|
|        45|        50|        17|        11|
+----------+----------+----------+----------+

Now that we have successfully prepared a data frame, let’s remove certain rows containing a specific value. We will begin by targeting a single column.

Using Filter() Function to Drop Rows with Values

This function filters the data based on the passed condition. The condition drives the logic of the function and specifies “what to filter” and “what no to”. We will use this function to target a specific value under a column.

Example

Here in the following example, we have used the filter() function to filter the rows containing the value “18”. It will only target the values under the “Sequence 1” column.

The new data frame consists of only those rows where “Sequence 1” is not equal to “18”. The entire row containing this value has been dropped from the data frame.

from pyspark.sql import SparkSession
sparkOBJ = SparkSession.builder.appName("SPARK_DSC").getOrCreate()
dataset = [("23", "25", "18", "19"), ("18", "25", "28", "40"),
           ("15", "18", "45", "11"), ("12", "8", "18", "28"),
           ("25", "42", "18", "29"), ("45", "50", "17", "11")]
columnsLis = ["Sequence 1", "Sequence 2", "Sequence 3", "Sequence 4"]

dataframe_SPK = sparkOBJ.createDataFrame(dataset, columnsLis)
print("the original data frame is: -")
dataframe_SPK.show()
dataframe_SPK.filter(dataframe_SPK["Sequence 1"] != "18").show()

Output

the original data frame is: -
+----------+----------+----------+----------+
|Sequence 1|Sequence 2|Sequence 3|Sequence 4|
+----------+----------+----------+----------+
|        23|        25|        18|        19|
|        18|        25|        28|        40|
|        15|        18|        45|        11|
|        12|         8|        18|        28|
|        25|        42|        18|        29|
|        45|        50|        17|        11|
+----------+----------+----------+----------+

+----------+----------+----------+----------+
|Sequence 1|Sequence 2|Sequence 3|Sequence 4|
+----------+----------+----------+----------+
|        23|        25|        18|        19|
|        15|        18|        45|        11|
|        12|         8|        18|        28|
|        25|        42|        18|        29|
|        45|        50|        17|        11|
+----------+----------+----------+----------+

Using Where() Function to Drop Rows

We can perform a similar operation with the help of “where()” function. This function also works around a particular condition. We generate a new data frame if the condition is “true”. Let’s look at its implementation.

Example

dataframe_SPK.where(dataframe_SPK["Sequence 1"] != "18").show()

Output

+----------+----------+----------+----------+
|Sequence 1|Sequence 2|Sequence 3|Sequence 4|
+----------+----------+----------+----------+
|        23|        25|        18|        19|
|        15|        18|        45|        11|
|        12|         8|        18|        28|
|        25|        42|        18|        29|
|        45|        50|        17|        11|
+----------+----------+----------+----------+

Here, the “where()” function behaves exactly the same way as “filter()” function. Even the syntax is also the same. These two methods can easily weave conditions and drop rows based on it. These functions can also be applied on multiple columns and more data can be filtered.

Example

dataframe_SPK.filter((dataframe_SPK["Sequence 1"] != "18") & 
(dataframe_SPK["Sequence 2"] != "18")
                    & (dataframe_SPK["Sequence 3"] != "18") & 
(dataframe_SPK["Sequence 4"] != "18")).show()

However, it is important to firstly analyze the data frame and then select the relevant information.

Conclusion

This article focuses on the fundamentals of “PySpark” data frame. We discussed the different methods to drop rows containing specific data. We used filter() function and where() function and established conditions to drop the rows. Elimination of the data from the data frame is a crucial step and it should done carefully.

Updated on: 05-May-2023

707 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements