Cleaning Data with Dropna in Pyspark

Data cleaning is a crucial step in any data analysis or data science project to ensure accuracy and reliability. PySpark's dropna() function provides powerful capabilities for removing rows containing missing or null values from DataFrames, making it essential for big data processing.

The dropna() function allows you to specify conditions for removing rows based on missing values, with flexible parameters for different cleaning strategies.

Syntax

df.dropna(how="any", thresh=None, subset=None)

Parameters

  • how Determines when to drop rows. Use "any" to drop rows with any null values, or "all" to drop only rows where all values are null.

  • thresh Minimum number of non-null values required for a row to be retained.

  • subset Specific columns to consider when checking for null values.

Creating Sample DataFrame

First, let's create a PySpark DataFrame with missing values to demonstrate the cleaning process

from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
    .master("local") \
    .appName("DataCleaning") \
    .getOrCreate()

# Sample data with null values
data = [(1, "John", "Data Scientist", "Seattle"),
        (2, None, "Software Developer", None),
        (3, "Emma", "Data Analyst", "New York"),
        (4, None, None, "San Francisco"),
        (5, "Andrew", "Android Developer", "Los Angeles"),
        (6, "Sarah", None, None),
        (None, None, None, None)]

schema = ["Id", "Name", "Job_Profile", "City"]
df = spark.createDataFrame(data, schema)
df.show()
+----+------+------------------+-------------+
|  Id|  Name|      Job_Profile|         City|
+----+------+------------------+-------------+
|   1|  John| Data Scientist|      Seattle|
|   2|  null|Software Developer|         null|
|   3|  Emma|     Data Analyst|     New York|
|   4|  null|              null|San Francisco|
|   5|Andrew|Android Developer|  Los Angeles|
|   6| Sarah|              null|         null|
|null|  null|              null|         null|
+----+------+------------------+-------------+

Using how="any" Parameter

The how="any" parameter drops any row containing at least one null value

# Drop rows with any null values
clean_df = df.dropna(how="any")
clean_df.show()
+---+------+-----------------+----------+
| Id|  Name|     Job_Profile|      City|
+---+------+-----------------+----------+
|  1|  John|Data Scientist|   Seattle|
|  3|  Emma|   Data Analyst|  New York|
|  5|Andrew|Android Developer|Los Angeles|
+---+------+-----------------+----------+

Using how="all" Parameter

The how="all" parameter only drops rows where all values are null

# Drop rows where all values are null
clean_df = df.dropna(how="all")
clean_df.show()
+---+------+------------------+-------------+
| Id|  Name|      Job_Profile|         City|
+---+------+------------------+-------------+
|  1|  John|   Data Scientist|      Seattle|
|  2|  null|Software Developer|         null|
|  3|  Emma|     Data Analyst|     New York|
|  4|  null|              null|San Francisco|
|  5|Andrew|Android Developer|  Los Angeles|
|  6| Sarah|              null|         null|
+---+------+------------------+-------------+

Using thresh Parameter

The thresh parameter specifies the minimum number of non-null values required to keep a row

# Keep rows with at least 2 non-null values
clean_df = df.dropna(thresh=2)
clean_df.show()
+---+------+------------------+-------------+
| Id|  Name|      Job_Profile|         City|
+---+------+------------------+-------------+
|  1|  John|   Data Scientist|      Seattle|
|  2|  null|Software Developer|         null|
|  3|  Emma|     Data Analyst|     New York|
|  4|  null|              null|San Francisco|
|  5|Andrew|Android Developer|  Los Angeles|
|  6| Sarah|              null|         null|
+---+------+------------------+-------------+

Using subset Parameter

The subset parameter allows you to specify which columns to check for null values

# Drop rows where 'City' column has null values
clean_df = df.dropna(subset=["City"])
clean_df.show()
+---+------+-----------------+-------------+
| Id|  Name|     Job_Profile|         City|
+---+------+-----------------+-------------+
|  1|  John|Data Scientist|      Seattle|
|  3|  Emma|   Data Analyst|     New York|
|  4|  null|           null|San Francisco|
|  5|Andrew|Android Developer|  Los Angeles|
+---+------+-----------------+-------------+

Comparison of Methods

Method Behavior Use Case
how="any" Drop rows with any null values Strict data quality requirements
how="all" Drop rows where all values are null Keep partially complete records
thresh=n Keep rows with at least n non-null values Flexible threshold-based cleaning
subset=[cols] Check specific columns only Clean based on critical columns

Conclusion

PySpark's dropna() function provides flexible options for cleaning DataFrames by removing rows with missing values. Choose the appropriate parameters based on your data quality requirements and analysis needs.

Updated on: 2026-03-27T11:02:42+05:30

741 Views

Advertisements