Drop duplicate rows in PySpark DataFrame

PySpark is a Python API for Apache Spark, designed to process large-scale data in real-time with distributed computing capabilities. Unlike regular DataFrames, PySpark DataFrames distribute data across clusters and follow a strict schema for optimized processing.

In this article, we'll explore different methods to drop duplicate rows from PySpark DataFrames using distinct() and dropDuplicates() functions.

Installation

Install PySpark using pip ?

pip install pyspark

Creating a PySpark DataFrame

First, let's create a sample DataFrame with duplicate rows to demonstrate the deduplication methods ?

from pyspark.sql import SparkSession
import pandas as pd

# Create SparkSession
spark = SparkSession.builder.appName("DropDuplicates").getOrCreate()

# Sample dataset with duplicates
dataset = {
    "Carname": ["Audi", "Mercedes", "BMW", "Audi", "Audi"],
    "Max Speed": ["300 KPH", "250 KPH", "220 KPH", "300 KPH", "300 KPH"],
    "Car number": ["MS321", "QR345", "WX281", "MS321", "MS321"]
}

# Create pandas DataFrame first, then convert to PySpark
dataframe_pd = pd.DataFrame(dataset)
dataframe_spk = spark.createDataFrame(dataframe_pd)

print("Original DataFrame:")
dataframe_spk.show()
Original DataFrame:
+--------+---------+----------+
| Carname|Max Speed|Car number|
+--------+---------+----------+
|    Audi|  300 KPH|     MS321|
|Mercedes|  250 KPH|     QR345|
|     BMW|  220 KPH|     WX281|
|    Audi|  300 KPH|     MS321|
|    Audi|  300 KPH|     MS321|
+--------+---------+----------+

Using distinct() Method

The distinct() method returns a new DataFrame with unique rows, removing all duplicates ?

from pyspark.sql import SparkSession
import pandas as pd

spark = SparkSession.builder.appName("DropDuplicates").getOrCreate()

dataset = {
    "Carname": ["Audi", "Mercedes", "BMW", "Audi", "Audi"],
    "Max Speed": ["300 KPH", "250 KPH", "220 KPH", "300 KPH", "300 KPH"],
    "Car number": ["MS321", "QR345", "WX281", "MS321", "MS321"]
}

dataframe_pd = pd.DataFrame(dataset)
dataframe_spk = spark.createDataFrame(dataframe_pd)

print("After removing duplicates with distinct():")
dataframe_spk.distinct().show()
After removing duplicates with distinct():
+--------+---------+----------+
| Carname|Max Speed|Car number|
+--------+---------+----------+
|Mercedes|  250 KPH|     QR345|
|     BMW|  220 KPH|     WX281|
|    Audi|  300 KPH|     MS321|
+--------+---------+----------+

Using dropDuplicates() Method

The dropDuplicates() method provides more control and works similarly to distinct() ?

from pyspark.sql import SparkSession
import pandas as pd

spark = SparkSession.builder.appName("DropDuplicates").getOrCreate()

dataset = {
    "Carname": ["Audi", "Mercedes", "BMW", "Audi", "Audi"],
    "Max Speed": ["300 KPH", "250 KPH", "220 KPH", "300 KPH", "300 KPH"],
    "Car number": ["MS321", "QR345", "WX281", "MS321", "MS321"]
}

dataframe_pd = pd.DataFrame(dataset)
dataframe_spk = spark.createDataFrame(dataframe_pd)

print("After removing duplicates with dropDuplicates():")
dataframe_spk.dropDuplicates().show()
After removing duplicates with dropDuplicates():
+--------+---------+----------+
| Carname|Max Speed|Car number|
+--------+---------+----------+
|Mercedes|  250 KPH|     QR345|
|    Audi|  300 KPH|     MS321|
|     BMW|  220 KPH|     WX281|
+--------+---------+----------+

Dropping Duplicates from Specific Columns

You can specify particular columns to check for duplicates using dropDuplicates() with column names ?

from pyspark.sql import SparkSession
import pandas as pd

spark = SparkSession.builder.appName("DropDuplicates").getOrCreate()

dataset = {
    "Carname": ["Audi", "Mercedes", "BMW", "Audi", "Tesla"],
    "Max Speed": ["300 KPH", "250 KPH", "220 KPH", "280 KPH", "400 KPH"],
    "Car number": ["MS321", "QR345", "WX281", "AB123", "TL567"]
}

dataframe_pd = pd.DataFrame(dataset)
dataframe_spk = spark.createDataFrame(dataframe_pd)

print("Original DataFrame:")
dataframe_spk.show()

print("Remove duplicates based on 'Carname' column only:")
dataframe_spk.dropDuplicates(['Carname']).show()
Original DataFrame:
+--------+---------+----------+
| Carname|Max Speed|Car number|
+--------+---------+----------+
|    Audi|  300 KPH|     MS321|
|Mercedes|  250 KPH|     QR345|
|     BMW|  220 KPH|     WX281|
|    Audi|  280 KPH|     AB123|
|   Tesla|  400 KPH|     TL567|
+--------+---------+----------+

Remove duplicates based on 'Carname' column only:
+--------+---------+----------+
| Carname|Max Speed|Car number|
+--------+---------+----------+
|Mercedes|  250 KPH|     QR345|
|     BMW|  220 KPH|     WX281|
|    Audi|  300 KPH|     MS321|
|   Tesla|  400 KPH|     TL567|
+--------+---------+----------+

Comparison

Method Parameters Use Case
distinct() None Remove duplicates from entire row
dropDuplicates() Optional column list More flexible, can target specific columns

Conclusion

Use distinct() for simple deduplication of entire rows. Use dropDuplicates() when you need more control or want to check specific columns for duplicates.

Updated on: 2026-03-27T06:12:57+05:30

586 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements