Article Categories

Selected Reading

Cleaning Data with Dropna in Pyspark

Python PySpark Programming

Data cleaning is a crucial step in any data analysis or data science project to ensure accuracy and reliability. PySpark's dropna() function provides powerful capabilities for removing rows containing missing or null values from DataFrames, making it essential for big data processing.

The dropna() function allows you to specify conditions for removing rows based on missing values, with flexible parameters for different cleaning strategies.

Syntax

df.dropna(how="any", thresh=None, subset=None)

Parameters

how Determines when to drop rows. Use "any" to drop rows with any null values, or "all" to drop only rows where all values are null.
thresh Minimum number of non-null values required for a row to be retained.
subset Specific columns to consider when checking for null values.

Creating Sample DataFrame

First, let's create a PySpark DataFrame with missing values to demonstrate the cleaning process

from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
    .master("local") \
    .appName("DataCleaning") \
    .getOrCreate()

# Sample data with null values
data = [(1, "John", "Data Scientist", "Seattle"),
        (2, None, "Software Developer", None),
        (3, "Emma", "Data Analyst", "New York"),
        (4, None, None, "San Francisco"),
        (5, "Andrew", "Android Developer", "Los Angeles"),
        (6, "Sarah", None, None),
        (None, None, None, None)]

schema = ["Id", "Name", "Job_Profile", "City"]
df = spark.createDataFrame(data, schema)
df.show()

+----+------+------------------+-------------+
|  Id|  Name|      Job_Profile|         City|
+----+------+------------------+-------------+
|   1|  John| Data Scientist|      Seattle|
|   2|  null|Software Developer|         null|
|   3|  Emma|     Data Analyst|     New York|
|   4|  null|              null|San Francisco|
|   5|Andrew|Android Developer|  Los Angeles|
|   6| Sarah|              null|         null|
|null|  null|              null|         null|
+----+------+------------------+-------------+

Using how="any" Parameter

The how="any" parameter drops any row containing at least one null value

# Drop rows with any null values
clean_df = df.dropna(how="any")
clean_df.show()

+---+------+-----------------+----------+
| Id|  Name|     Job_Profile|      City|
+---+------+-----------------+----------+
|  1|  John|Data Scientist|   Seattle|
|  3|  Emma|   Data Analyst|  New York|
|  5|Andrew|Android Developer|Los Angeles|
+---+------+-----------------+----------+

Using how="all" Parameter

The how="all" parameter only drops rows where all values are null

# Drop rows where all values are null
clean_df = df.dropna(how="all")
clean_df.show()

+---+------+------------------+-------------+
| Id|  Name|      Job_Profile|         City|
+---+------+------------------+-------------+
|  1|  John|   Data Scientist|      Seattle|
|  2|  null|Software Developer|         null|
|  3|  Emma|     Data Analyst|     New York|
|  4|  null|              null|San Francisco|
|  5|Andrew|Android Developer|  Los Angeles|
|  6| Sarah|              null|         null|
+---+------+------------------+-------------+

Using thresh Parameter

The thresh parameter specifies the minimum number of non-null values required to keep a row

# Keep rows with at least 2 non-null values
clean_df = df.dropna(thresh=2)
clean_df.show()

+---+------+------------------+-------------+
| Id|  Name|      Job_Profile|         City|
+---+------+------------------+-------------+
|  1|  John|   Data Scientist|      Seattle|
|  2|  null|Software Developer|         null|
|  3|  Emma|     Data Analyst|     New York|
|  4|  null|              null|San Francisco|
|  5|Andrew|Android Developer|  Los Angeles|
|  6| Sarah|              null|         null|
+---+------+------------------+-------------+

Using subset Parameter

The subset parameter allows you to specify which columns to check for null values

# Drop rows where 'City' column has null values
clean_df = df.dropna(subset=["City"])
clean_df.show()

+---+------+-----------------+-------------+
| Id|  Name|     Job_Profile|         City|
+---+------+-----------------+-------------+
|  1|  John|Data Scientist|      Seattle|
|  3|  Emma|   Data Analyst|     New York|
|  4|  null|           null|San Francisco|
|  5|Andrew|Android Developer|  Los Angeles|
+---+------+-----------------+-------------+

Comparison of Methods

Method	Behavior	Use Case
`how="any"`	Drop rows with any null values	Strict data quality requirements
`how="all"`	Drop rows where all values are null	Keep partially complete records
`thresh=n`	Keep rows with at least n non-null values	Flexible threshold-based cleaning
`subset=[cols]`	Check specific columns only	Clean based on critical columns

Conclusion

PySpark's dropna() function provides flexible options for cleaning DataFrames by removing rows with missing values. Choose the appropriate parameters based on your data quality requirements and analysis needs.

Mukul Latiyan

Updated on: 2026-03-27T11:02:42+05:30

819 Views

Previous Next