Cleaning Data with Apache Spark in Python

Apache Spark is an open-source big data processing framework that enables parallel and distributed processing of large datasets. Data cleaning is a crucial step in data analysis, and Spark provides powerful tools for handling missing values, duplicates, outliers, and data type conversions efficiently.

Installation

Before working with Apache Spark in Python, install the PySpark library ?

pip install pyspark

Handling Missing Values

Missing values are common in real-world datasets. Apache Spark provides several strategies to handle them:

  • Dropping rows Remove records containing missing values

  • Filling missing values Replace with specific values like mean, median, or constants

  • Interpolation Use mathematical methods to estimate missing values

Example

from pyspark.sql import SparkSession
from pyspark.sql.functions import mean

spark = SparkSession.builder.appName("MissingData").getOrCreate()

# Create sample DataFrame with missing values
data = [("John", 25, None), ("Jane", 30, 35.5), ("Jim", None, 40.0), ("Joan", 32, None)]
columns = ["Name", "Age", "Salary"]
df = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df.show()

# Calculate mean values for imputation
mean_age = df.agg(mean(df["Age"])).first()[0]
mean_salary = df.agg(mean(df["Salary"])).first()[0]

# Fill missing values with mean
df_cleaned = df.fillna({"Age": mean_age, "Salary": mean_salary})

print("Cleaned DataFrame:")
df_cleaned.show()

spark.stop()
Original DataFrame:
+----+----+------+
|Name| Age|Salary|
+----+----+------+
|John|  25|  null|
|Jane|  30|  35.5|
| Jim|null|  40.0|
|Joan|  32|  null|
+----+----+------+

Cleaned DataFrame:
+----+---+------+
|Name|Age|Salary|
+----+---+------+
|John| 25| 37.75|
|Jane| 30|  35.5|
| Jim| 29|  40.0|
|Joan| 32| 37.75|
+----+---+------+

Handling Duplicates

Duplicate records can skew analysis results. Spark provides the dropDuplicates() method to remove identical rows efficiently.

Example

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DuplicateData").getOrCreate()

# Create DataFrame with duplicate records
data = [("John", 25, 90.0), ("Jane", 30, 35.5), ("Jim", 20, 200.0), 
        ("Joan", 32, 50.0), ("John", 25, 90.0), ("Jim", 20, 200.0)]
columns = ["Name", "Age", "Salary"]
df = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df.show()

# Remove duplicates
df_cleaned = df.dropDuplicates()

print("Cleaned DataFrame:")
df_cleaned.show()

spark.stop()
Original DataFrame:
+----+---+------+
|Name|Age|Salary|
+----+---+------+
|John| 25|  90.0|
|Jane| 30|  35.5|
| Jim| 20| 200.0|
|Joan| 32|  50.0|
|John| 25|  90.0|
| Jim| 20| 200.0|
+----+---+------+

Cleaned DataFrame:
+----+---+------+
|Name|Age|Salary|
+----+---+------+
|Jane| 30|  35.5|
|John| 25|  90.0|
| Jim| 20| 200.0|
|Joan| 32|  50.0|
+----+---+------+

Handling Outliers

Outliers are extreme values that can significantly impact statistical analysis. Common approaches include filtering based on standard deviation or using percentile-based methods.

Example

from pyspark.sql import SparkSession
from pyspark.sql.functions import mean, stddev, abs

spark = SparkSession.builder.appName("OutlierHandling").getOrCreate()

# Create DataFrame with outliers
data = [("John", 25, 90.0), ("Jane", 30, 35.5), ("Jim", 20, 200.0), ("Joan", 32, 50.0)]
columns = ["Name", "Age", "Salary"]
df = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df.show()

# Calculate mean and standard deviation
mean_salary = df.agg(mean(df["Salary"])).first()[0]
stddev_salary = df.agg(stddev(df["Salary"])).first()[0]

# Filter outliers (values beyond 1 standard deviation)
df_cleaned = df.filter(abs(df["Salary"] - mean_salary) < stddev_salary)

print("Cleaned DataFrame:")
df_cleaned.show()

spark.stop()
Original DataFrame:
+----+---+------+
|Name|Age|Salary|
+----+---+------+
|John| 25|  90.0|
|Jane| 30|  35.5|
| Jim| 20| 200.0|
|Joan| 32|  50.0|
+----+---+------+

Cleaned DataFrame:
+----+---+------+
|Name|Age|Salary|
+----+---+------+
|John| 25|  90.0|
|Jane| 30|  35.5|
|Joan| 32|  50.0|
+----+---+------+

Converting Data Types

Data type conversion ensures columns have appropriate formats for analysis. Use the cast() method to convert between string, integer, float, and other data types.

Example

from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, FloatType

spark = SparkSession.builder.appName("DataTypeConversion").getOrCreate()

# Create DataFrame with string columns
data = [("John", "25", "90"), ("Jane", "30", "35"), ("Jim", "20", "200"), ("Joan", "32", "50")]
columns = ["Name", "Age", "Salary"]
df = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df.show()
df.printSchema()

# Convert Age to integer and Salary to float
df_converted = df.withColumn("Age", df["Age"].cast(IntegerType())) \
                 .withColumn("Salary", df["Salary"].cast(FloatType()))

print("Converted DataFrame:")
df_converted.show()
df_converted.printSchema()

spark.stop()
Original DataFrame:
+----+---+------+
|Name|Age|Salary|
+----+---+------+
|John| 25|    90|
|Jane| 30|    35|
| Jim| 20|   200|
|Joan| 32|    50|
+----+---+------+
root
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Salary: string (nullable = true)

Converted DataFrame:
+----+---+------+
|Name|Age|Salary|
+----+---+------+
|John| 25|  90.0|
|Jane| 30|  35.0|
| Jim| 20| 200.0|
|Joan| 32|  50.0|
+----+---+------+
root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Salary: float (nullable = true)

Conclusion

Apache Spark provides comprehensive tools for data cleaning including handling missing values, duplicates, outliers, and data type conversions. These capabilities, combined with Spark's distributed processing power, make it an excellent choice for preparing large datasets for analysis and machine learning applications.

Updated on: 2026-04-02T17:10:53+05:30

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements