Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Drop duplicate rows in PySpark DataFrame
PySpark is a Python API for Apache Spark, designed to process large-scale data in real-time with distributed computing capabilities. Unlike regular DataFrames, PySpark DataFrames distribute data across clusters and follow a strict schema for optimized processing.
In this article, we'll explore different methods to drop duplicate rows from PySpark DataFrames using distinct() and dropDuplicates() functions.
Installation
Install PySpark using pip ?
pip install pyspark
Creating a PySpark DataFrame
First, let's create a sample DataFrame with duplicate rows to demonstrate the deduplication methods ?
from pyspark.sql import SparkSession
import pandas as pd
# Create SparkSession
spark = SparkSession.builder.appName("DropDuplicates").getOrCreate()
# Sample dataset with duplicates
dataset = {
"Carname": ["Audi", "Mercedes", "BMW", "Audi", "Audi"],
"Max Speed": ["300 KPH", "250 KPH", "220 KPH", "300 KPH", "300 KPH"],
"Car number": ["MS321", "QR345", "WX281", "MS321", "MS321"]
}
# Create pandas DataFrame first, then convert to PySpark
dataframe_pd = pd.DataFrame(dataset)
dataframe_spk = spark.createDataFrame(dataframe_pd)
print("Original DataFrame:")
dataframe_spk.show()
Original DataFrame: +--------+---------+----------+ | Carname|Max Speed|Car number| +--------+---------+----------+ | Audi| 300 KPH| MS321| |Mercedes| 250 KPH| QR345| | BMW| 220 KPH| WX281| | Audi| 300 KPH| MS321| | Audi| 300 KPH| MS321| +--------+---------+----------+
Using distinct() Method
The distinct() method returns a new DataFrame with unique rows, removing all duplicates ?
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.appName("DropDuplicates").getOrCreate()
dataset = {
"Carname": ["Audi", "Mercedes", "BMW", "Audi", "Audi"],
"Max Speed": ["300 KPH", "250 KPH", "220 KPH", "300 KPH", "300 KPH"],
"Car number": ["MS321", "QR345", "WX281", "MS321", "MS321"]
}
dataframe_pd = pd.DataFrame(dataset)
dataframe_spk = spark.createDataFrame(dataframe_pd)
print("After removing duplicates with distinct():")
dataframe_spk.distinct().show()
After removing duplicates with distinct(): +--------+---------+----------+ | Carname|Max Speed|Car number| +--------+---------+----------+ |Mercedes| 250 KPH| QR345| | BMW| 220 KPH| WX281| | Audi| 300 KPH| MS321| +--------+---------+----------+
Using dropDuplicates() Method
The dropDuplicates() method provides more control and works similarly to distinct() ?
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.appName("DropDuplicates").getOrCreate()
dataset = {
"Carname": ["Audi", "Mercedes", "BMW", "Audi", "Audi"],
"Max Speed": ["300 KPH", "250 KPH", "220 KPH", "300 KPH", "300 KPH"],
"Car number": ["MS321", "QR345", "WX281", "MS321", "MS321"]
}
dataframe_pd = pd.DataFrame(dataset)
dataframe_spk = spark.createDataFrame(dataframe_pd)
print("After removing duplicates with dropDuplicates():")
dataframe_spk.dropDuplicates().show()
After removing duplicates with dropDuplicates(): +--------+---------+----------+ | Carname|Max Speed|Car number| +--------+---------+----------+ |Mercedes| 250 KPH| QR345| | Audi| 300 KPH| MS321| | BMW| 220 KPH| WX281| +--------+---------+----------+
Dropping Duplicates from Specific Columns
You can specify particular columns to check for duplicates using dropDuplicates() with column names ?
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.appName("DropDuplicates").getOrCreate()
dataset = {
"Carname": ["Audi", "Mercedes", "BMW", "Audi", "Tesla"],
"Max Speed": ["300 KPH", "250 KPH", "220 KPH", "280 KPH", "400 KPH"],
"Car number": ["MS321", "QR345", "WX281", "AB123", "TL567"]
}
dataframe_pd = pd.DataFrame(dataset)
dataframe_spk = spark.createDataFrame(dataframe_pd)
print("Original DataFrame:")
dataframe_spk.show()
print("Remove duplicates based on 'Carname' column only:")
dataframe_spk.dropDuplicates(['Carname']).show()
Original DataFrame: +--------+---------+----------+ | Carname|Max Speed|Car number| +--------+---------+----------+ | Audi| 300 KPH| MS321| |Mercedes| 250 KPH| QR345| | BMW| 220 KPH| WX281| | Audi| 280 KPH| AB123| | Tesla| 400 KPH| TL567| +--------+---------+----------+ Remove duplicates based on 'Carname' column only: +--------+---------+----------+ | Carname|Max Speed|Car number| +--------+---------+----------+ |Mercedes| 250 KPH| QR345| | BMW| 220 KPH| WX281| | Audi| 300 KPH| MS321| | Tesla| 400 KPH| TL567| +--------+---------+----------+
Comparison
| Method | Parameters | Use Case |
|---|---|---|
distinct() |
None | Remove duplicates from entire row |
dropDuplicates() |
Optional column list | More flexible, can target specific columns |
Conclusion
Use distinct() for simple deduplication of entire rows. Use dropDuplicates() when you need more control or want to check specific columns for duplicates.
