Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Cleaning Data with Dropna in Pyspark
Data cleaning is a crucial step in any data analysis or data science project to ensure accuracy and reliability. PySpark's dropna() function provides powerful capabilities for removing rows containing missing or null values from DataFrames, making it essential for big data processing.
The dropna() function allows you to specify conditions for removing rows based on missing values, with flexible parameters for different cleaning strategies.
Syntax
df.dropna(how="any", thresh=None, subset=None)
Parameters
how Determines when to drop rows. Use
"any"to drop rows with any null values, or"all"to drop only rows where all values are null.thresh Minimum number of non-null values required for a row to be retained.
subset Specific columns to consider when checking for null values.
Creating Sample DataFrame
First, let's create a PySpark DataFrame with missing values to demonstrate the cleaning process
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder \
.master("local") \
.appName("DataCleaning") \
.getOrCreate()
# Sample data with null values
data = [(1, "John", "Data Scientist", "Seattle"),
(2, None, "Software Developer", None),
(3, "Emma", "Data Analyst", "New York"),
(4, None, None, "San Francisco"),
(5, "Andrew", "Android Developer", "Los Angeles"),
(6, "Sarah", None, None),
(None, None, None, None)]
schema = ["Id", "Name", "Job_Profile", "City"]
df = spark.createDataFrame(data, schema)
df.show()
+----+------+------------------+-------------+ | Id| Name| Job_Profile| City| +----+------+------------------+-------------+ | 1| John| Data Scientist| Seattle| | 2| null|Software Developer| null| | 3| Emma| Data Analyst| New York| | 4| null| null|San Francisco| | 5|Andrew|Android Developer| Los Angeles| | 6| Sarah| null| null| |null| null| null| null| +----+------+------------------+-------------+
Using how="any" Parameter
The how="any" parameter drops any row containing at least one null value
# Drop rows with any null values clean_df = df.dropna(how="any") clean_df.show()
+---+------+-----------------+----------+ | Id| Name| Job_Profile| City| +---+------+-----------------+----------+ | 1| John|Data Scientist| Seattle| | 3| Emma| Data Analyst| New York| | 5|Andrew|Android Developer|Los Angeles| +---+------+-----------------+----------+
Using how="all" Parameter
The how="all" parameter only drops rows where all values are null
# Drop rows where all values are null clean_df = df.dropna(how="all") clean_df.show()
+---+------+------------------+-------------+ | Id| Name| Job_Profile| City| +---+------+------------------+-------------+ | 1| John| Data Scientist| Seattle| | 2| null|Software Developer| null| | 3| Emma| Data Analyst| New York| | 4| null| null|San Francisco| | 5|Andrew|Android Developer| Los Angeles| | 6| Sarah| null| null| +---+------+------------------+-------------+
Using thresh Parameter
The thresh parameter specifies the minimum number of non-null values required to keep a row
# Keep rows with at least 2 non-null values clean_df = df.dropna(thresh=2) clean_df.show()
+---+------+------------------+-------------+ | Id| Name| Job_Profile| City| +---+------+------------------+-------------+ | 1| John| Data Scientist| Seattle| | 2| null|Software Developer| null| | 3| Emma| Data Analyst| New York| | 4| null| null|San Francisco| | 5|Andrew|Android Developer| Los Angeles| | 6| Sarah| null| null| +---+------+------------------+-------------+
Using subset Parameter
The subset parameter allows you to specify which columns to check for null values
# Drop rows where 'City' column has null values clean_df = df.dropna(subset=["City"]) clean_df.show()
+---+------+-----------------+-------------+ | Id| Name| Job_Profile| City| +---+------+-----------------+-------------+ | 1| John|Data Scientist| Seattle| | 3| Emma| Data Analyst| New York| | 4| null| null|San Francisco| | 5|Andrew|Android Developer| Los Angeles| +---+------+-----------------+-------------+
Comparison of Methods
| Method | Behavior | Use Case |
|---|---|---|
how="any" |
Drop rows with any null values | Strict data quality requirements |
how="all" |
Drop rows where all values are null | Keep partially complete records |
thresh=n |
Keep rows with at least n non-null values | Flexible threshold-based cleaning |
subset=[cols] |
Check specific columns only | Clean based on critical columns |
Conclusion
PySpark's dropna() function provides flexible options for cleaning DataFrames by removing rows with missing values. Choose the appropriate parameters based on your data quality requirements and analysis needs.
