Drop duplicate rows in PySpark DataFrame


PySpark is a tool designed by the Apache spark community to process data in real time and analyse the results in a local python environment. Spark data frames are different from other data frames as it distributes the information and follows a schema.

Spark can handle stream processing as well as batch processing and this is the reason for their popularity. A PySpark data frame requires a session in order to generate an entry point and it performs on-system processing of the data (RAM). You can install PySpark module on windows using the following command –

pip install pyspark

In this article, we will create a PySpark data frame and discuss the different methods to drop duplicate rows from this data frame. Let’s understand the concept of a PySpark data frame.

Creating and Understanding the PySpark Data Frame

Just like any other dataframe, PySpark stores the data in a tabular manner. It allows the programmer to work on structured as well semi structured data and provides high level APIs (python, Java) for processing complex datasets. It can analyse data very quickly and therefore it is very helpful in stream processing as well as batch processing.

Now that we have discussed the basics of PySpark data frame, let’s use the python codes to create one. We will create a PySpark data frame consisting of information related to different car racers.

Example

  • We imported the necessary libraries including “pandas and “pyspark”. We also imported a unified interface named “SparkSession”.

  • This interface makes sure that the Spark framework functions properly. It acts as an “entry point” for introducing several Spark APIs which increases the efficiency of data processing. In short, we create a SparkSession to set up the required configuration.

  • After this we created an instance of this SparkSession using the “builder” API. We also used the “getorCreate()” method to lock an existing session or replace it with a new one.

  • After completing the configuration part, we prepared a dataset dictionary consisting of different car features. We used this dataset to generate a pandas data frame.

  • The generated 5X3 data frame is stored in the “dataframe_pd” variable. This variable is passed as the argument for “createDataFrame()” method of “SparkSession” to create a PySpark data frame.

  • We used a pandas data frame to generate a PySpark data frame but this is not a mandatory step. We can directly use a list of tuples to create a dataset and then pass it to the “createDataFrame()” method.

  • Finally we displayed the dataframe using “dataframe_spk.show()” method.

Example

import pyspark
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.getOrCreate()
dataset = {"Carname":["Audi", "Mercedes", "BMW", "Audi", "Audi"], "Max Speed": ["300 KPH", "250 KPH", "220 KPH", "300 KPH", "300 KPH"], "Car number":["MS321", "QR345", "WX281", "MS321", "MS321"]}
dataframe_pd = pd.DataFrame(dataset, index= ["Racer1", "Racer2", "Racer3", "Racer1", "Racer1"])
dataframe_spk = spark.createDataFrame(dataframe_pd)
print("The original data frame is like: -")
dataframe_spk.show()

Output

The original data frame is like: -
    Carname Max Speed Car number
      Audi   300 KPH      MS321
  Mercedes   250 KPH      QR345
       BMW   220 KPH      WX281
      Audi   300 KPH      MS321
      Audi   300 KPH      MS321

Now that we have we created a PySpark data frame, let’s discuss the different methods to drop rows from this data frame.

Using Distinct() Function to Drop Rows

This function returns a new data frame with distinct or unique rows. It eliminates all the duplicate rows from the data frame.

Example

We don’t pass any argument for this function. Let’s see its implementation.

from pyspark.sql import SparkSession
import pandas as pd

spark = SparkSession.builder.getOrCreate()

dataset = {"Carname":["Audi", "Mercedes", "BMW", "Audi", "Audi"], "Max Speed": ["300 KPH", "250 KPH", "220 KPH", "300 KPH", "300 KPH"], "Car number":["MS321", "QR345", "WX281", "MS321", "MS321"]}
dataframe_pd = pd.DataFrame(dataset)
dataframe_spk = spark.createDataFrame(dataframe_pd)
print("The original data frame is like: -")
dataframe_spk.show()
print("After dropping the duplicate rows we get: -")
dataframe_spk.distinct().show()

Output

The original data frame is like: -
    Carname Max Speed Car number
      Audi   300 KPH      MS321
  Mercedes   250 KPH      QR345
       BMW   220 KPH      WX281
      Audi   300 KPH      MS321
      Audi   300 KPH      MS321
After dropping the duplicate rows we get: -

    Carname Max Speed Car number
  Mercedes  250 KPH      QR345
      BMW   220 KPH      WX281
      Audi  300 KPH      MS321

After creating the PySpark data frame, we used the distinct() function to target the unique rows and eliminate them from the data frame.

Using dropDuplicate() Function

This is an alternative method that works the same way as distinct() function. We can target the columns and drop the rows accordingly. Let’s build the code.

Example

from pyspark.sql import SparkSession
import pandas as pd

spark = SparkSession.builder.getOrCreate()

dataset = {"Carname":["Audi", "Mercedes", "BMW", "Audi", "Audi"], "Max Speed": ["300 KPH", "250 KPH", "220 KPH", "300 KPH", "300 KPH"], "Car number":["MS321", "QR345", "WX281", "MS321", "MS321"]}
dataframe_pd = pd.DataFrame(dataset)
dataframe_spk = spark.createDataFrame(dataframe_pd)
print("The original data frame is like: -")
dataframe_spk.show()

print("After dropping the duplicate rows we get: -")
dataframe_spk.dropDuplicates().show()

Output

The original data frame is like: -
+--------+---------+----------+
| Carname|Max Speed|Car number|
+--------+---------+----------+
|    Audi|  300 KPH|     MS321|
|Mercedes|  250 KPH|     QR345|
|     BMW|  220 KPH|     WX281|
|    Audi|  300 KPH|     MS321|
|    Audi|  300 KPH|     MS321|
+--------+---------+----------+
After dropping the duplicate rows we get: -
+--------+---------+----------+
| Carname|Max Speed|Car number|
+--------+---------+----------+
|Mercedes|  250 KPH|     QR345|
|    Audi|  300 KPH|     MS321|
|     BMW|  220 KPH|     WX281|
+--------+---------+----------+

Targeting a Specific Column

We can check a specific column for duplicate values with the help of “select()” function. We will use the dropDuplicate() function on the selected columns.

Example

dataframe_spk.select(["Carname"]).dropDuplicates().show()

Output

+--------+
| Carname|
+--------+
|Mercedes|
|     BMW|
|    Audi|
+--------+ 

Conclusion

This article explains a basic operation of dropping the rows where the “column” value contains any duplicate or repetitive data. We discussed the different functions involved including “dropDuplicate()”, “Distinct()” and “select()”. We created a reference data frame and dropped duplicate values from it.

Updated on: 05-May-2023

282 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements