Cleaning Data with Dropna in Pyspark

Python PySpark Programming

In order to make sure that the data is accurate, trustworthy, and appropriate for the intended analysis, cleaning the data is a crucial step in any data analysis or data science endeavour. The data cleaning functions in Pyspark, like dropna, make it a potent tool for working with big datasets.

The dropna function in Pyspark allows you to remove rows from a DataFrame that contain missing or null values. Missing or null values can occur in a DataFrame for various reasons, such as incomplete data, data entry errors, or inconsistent data formats. Removing these rows can help ensure the quality of the data for downstream analysis.

Dropna is a versatile function that allows you to specify the conditions for removing rows. You can specify the axis along which to drop rows (0 for rows and 1 for columns), the threshold for the minimum number of non-null values required for a row to be retained, and the subset of columns to consider when checking for missing values. Additionally, dropna supports different methods for how to handle missing values, such as dropping rows with any missing values, only dropping rows with missing values in specific columns, or dropping rows based on a time threshold.

Using dropna in Pyspark can significantly improve the quality and reliability of your data. By removing rows with missing or null values, you can ensure that your analysis is based on complete and accurate data. With its flexibility and ease of use, dropna is an essential tool in any data cleaning toolkit for Pyspark users.

In this article, we will discuss the process of cleaning a DataFrame and the use of the dropna() function to achieve this. The primary purpose of cleaning a DataFrame is to ensure that it contains accurate and reliable data that is suitable for analysis.

The syntax of the dropna() function is as follows:

df.dropna(how="any", thresh=None, subset=None)

Where df is the DataFrame being cleaned. The function takes three parameters:

how − This parameter specifies whether to remove the row or column if any of its values are null. If the value is 'any', then the row or column will be dropped if any value is null. If the value is 'all', then the row or column will be dropped only if all its values are null.
thresh − This parameter specifies the minimum number of non−null values required for a row or column to be retained. If the number of non−null values in a row or column is less than the thresh value, then that row or column will be dropped.
subset − This parameter specifies the subset of columns to consider when checking for null values. If any of the values in the specified subset are null, then the row or column will be dropped.

By using the dropna() function with the appropriate parameters, you can clean your DataFrame and remove any null or missing values. This is important because null or missing values can lead to inaccuracies in your analysis, and removing them will improve the accuracy and reliability of your data. Additionally, dropna() is a versatile function that can be used on both small and large datasets, making it an essential tool for any data cleaning project in Pyspark.

Before utilizing the dropna method to remove null values, we must first create a Pyspark DataFrame. Once the DataFrame is created, we can proceed to apply the dropna method to eliminate any null values present within the DataFrame.

The prerequisite for you to be able to run the codes in this tutorial is to install pyspark module.

The below command will install the pyspark module.

Command

pip3 install pyspark

Consider the code shown below.

Example

# importing necessary libraries
from pyspark.sql import SparkSession

# function to create new SparkSession
def create_session():
	spk = SparkSession.builder \
		.master("local") \
		.appName("Employee_detail.com") \
		.getOrCreate()
	return spk

# function to create DataFrame from data and schema
def create_df(spark, data, schema):
	df1 = spark.createDataFrame(data, schema)
	return df1


if __name__ == "__main__":

	# calling function to create SparkSession
	spark = create_session()

	# creating sample data with different data types
	input_data = [(1, "John", "Data Scientist", "Seattle"),
				(2, None, "Software Developer", None),
				(3, "Emma", "Data Analyst", "New York"),
				(4, None, None, "San Francisco"),
				(5, "Andrew", "Android Developer", "Los Angeles"),
				(6, "Sarah", None, None),
				(None, None, None, None)]
	
	# creating schema for DataFrame
	schema = ["Id", "Name", "Job Profile", "City"]

	# calling function to create dataframe
	df = create_df(spark, input_data, schema)

	# displaying the created DataFrame
	df.show()

Explanation

This code demonstrates how to create a new SparkSession and DataFrame in PySpark. To do so, it imports the necessary library for creating a SparkSession.

Next, a function is defined called create_session() which sets up and configures a new SparkSession. This function specifies that Spark should be run locally on a single node, sets the name of the application, and either creates a new SparkSession or returns an existing one.

The create_df() function is defined next, which creates a new DataFrame using the input data and schema. This function takes in the SparkSession, input data, and schema as inputs, and returns a new DataFrame.

The input data is a list of tuples, where each tuple represents a row in the DataFrame. The schema is a list of column names, where each name corresponds to a column in the DataFrame.

Finally, the main section of the code calls the create_session() function to create a new SparkSession, defines the input data and schema for the DataFrame, and calls the create_df() function to create a new DataFrame using the input data and schema. The resulting DataFrame is then printed using the .show() method.

To run the above code we need to run the command shown below.

Command

python3 main.py

Once we run the above command, we can expect the output to the same as the one shown below.

Output

+----+------+------------------+------------------------------+
|  Id|  Name|       Job Profile|           City|
+----+------+------------------+------------------------------+
|   1|  John|     Data Scientist|           Seattle|
|   2|  null|       Software Developer|  null|
|   3|  Emma|  Data Analyst|             New York|
|   4|  null|       null|                            San Francisco|
|   5|Andrew|  Android Developer|    Los Angeles|
|   6| Sarah|    null|                            null|
|   null|  null|       null|                        null|
+----+------+------------------+-----------------------------+

Cleaning data with dropna using any parameter in PySpark.

In the code below, the dropna() function is called with the parameter how="any". This parameter specifies that any row or column containing any Null values will be dropped from the DataFrame.

Consider the code shown below.

Example

# importing necessary libraries
from pyspark.sql import SparkSession

# function to create new SparkSession
def create_session():
	spk = SparkSession.builder \
		.master("local") \
		.appName("Employee_detail.com") \
		.getOrCreate()
	return spk

# function to create DataFrame from data and schema
def create_df(spark, data, schema):
	df1 = spark.createDataFrame(data, schema)
	return df1


if __name__ == "__main__":

	# calling function to create SparkSession
	spark = create_session()

	# creating sample data with different data types
	input_data = [(1, "John", "Data Scientist", "Seattle"),
				(2, None, "Software Developer", None),
				(3, "Emma", "Data Analyst", "New York"),
				(4, None, None, "San Francisco"),
				(5, "Andrew", "Android Developer", "Los Angeles"),
				(6, "Sarah", None, None),
				(None, None, None, None)]
	
	# creating schema for DataFrame
	schema = ["Id", "Name", "Job Profile", "City"]

	# calling function to create dataframe
	df = create_df(spark, input_data, schema)

	# displaying the created DataFrame
	# df.show()

# if any row of the is having any Null
# value we are dropping that
# rows
df = df.dropna(how="any")
df.show()

To run the above code we need to run the command shown below.

Command

python3 main.py

Once we run the above command, we can expect the output to the same as the one shown below.

Output

+---+------+-----------------+--------------------+
| Id|  Name|      Job Profile|              City|
+---+------+-----------------+--------------------+
|  1|  John|       Data Scientist|          Seattle|
|  3|  Emma|     Data Analyst|            New York|
|  5| Andrew|     Android Developer|  Los Angeles|
+---+------+-----------------+--------------------+

Cleaning data with dropna using all parameters in PySpark.

In the code below, the dropna() function is called with the parameter how="all". This parameter specifies that any row or column containing only Null values will be dropped from the DataFrame.

Consider the code shown below.

Example

# importing necessary libraries
from pyspark.sql import SparkSession

# function to create new SparkSession
def create_session():
	spk = SparkSession.builder \
		.master("local") \
		.appName("Employee_detail.com") \
		.getOrCreate()
	return spk

# function to create DataFrame from data and schema
def create_df(spark, data, schema):
	df1 = spark.createDataFrame(data, schema)
	return df1


if __name__ == "__main__":

	# calling function to create SparkSession
	spark = create_session()

	# creating sample data with different data types
	input_data = [(1, "John", "Data Scientist", "Seattle"),
				(2, None, "Software Developer", None),
				(3, "Emma", "Data Analyst", "New York"),
				(4, None, None, "San Francisco"),
				(5, "Andrew", "Android Developer", "Los Angeles"),
				(6, "Sarah", None, None),
				(None, None, None, None)]
	
	# creating schema for DataFrame
	schema = ["Id", "Name", "Job Profile", "City"]

	# calling function to create dataframe
	df = create_df(spark, input_data, schema)

	# displaying the created DataFrame
	# df.show()

# if any row of the is having all Null
# value we are dropping that
# rows
df = df.dropna(how="all")
df.show()

To run the above code we need to run the command shown below.

Command

python3 main.py

Once we run the above command, we can expect the output to the same as the one shown below.

Output

+---+------+------------------+--------------------------+
| Id|  Name|       Job Profile|            City|
+---+------+------------------+--------------------------+
|  1|  John|     Data Scientist|          Seattle|
|  2|  null|       Software Developer| null|
|  3|  Emma|  Data Analyst|            New York|
|  4|  null|       null|                          San Francisco|
|  5|Andrew|  Android Developer|  Los Angeles|
|  6| Sarah|    null|                          null|
+---+------+------------------+--------------------------+

Cleaning data with dropna using thresh parameter in PySpark.

In the code below, the dropna() function is called with the parameter thresh=2. This parameter specifies that any row or column containing less than two non−Null values will be dropped from the DataFrame.

Cleaning data with dropna using thresh parameter in PySpark.

Example

# importing necessary libraries
from pyspark.sql import SparkSession

# function to create new SparkSession
def create_session():
	spk = SparkSession.builder \
		.master("local") \
		.appName("Employee_detail.com") \
		.getOrCreate()
	return spk

# function to create DataFrame from data and schema
def create_df(spark, data, schema):
	df1 = spark.createDataFrame(data, schema)
	return df1


if __name__ == "__main__":

	# calling function to create SparkSession
	spark = create_session()

	# creating sample data with different data types
	input_data = [(1, "John", "Data Scientist", "Seattle"),
				(2, None, "Software Developer", None),
				(3, "Emma", "Data Analyst", "New York"),
				(4, None, None, "San Francisco"),
				(5, "Andrew", "Android Developer", "Los Angeles"),
				(6, "Sarah", None, None),
				(None, None, None, None)]
	
	# creating schema for DataFrame
	schema = ["Id", "Name", "Job Profile", "City"]

	# calling function to create dataframe
	df = create_df(spark, input_data, schema)

	# displaying the created DataFrame
	# df.show()

# if thresh value is not
# satisfied then dropping
# that row
df = df.dropna(thresh=2)
df.show()

To run the above code we need to run the command shown below.

Command

python3 main.py

Once we run the above command, we can expect the output to the same as the one shown below.

Output

+---+------+------------------+--------------------------+
| Id|  Name|       Job Profile|            City|
+---+------+------------------+--------------------------+
|  1|  John|     Data Scientist|          Seattle|
|  2|  null|       Software Developer| null|
|  3|  Emma|  Data Analyst|            New York|
|  4|  null|       null|                          San Francisco|
|  5|Andrew|  Android Developer|  Los Angeles|
|  6| Sarah|    null|                          null|
+---+------+------------------+--------------------------+

Cleaning data with dropna using subset parameter in PySpark.

In the below code, we have passed the subset=’City’ parameter in the dropna() function which is the column name in respective of City column if any of the NULL value present in that column then we are dropping that row from the Dataframe.

Consider the code shown below.

Example

# importing necessary libraries
from pyspark.sql import SparkSession

# function to create new SparkSession
def create_session():
	spk = SparkSession.builder \
		.master("local") \
		.appName("Employee_detail.com") \
		.getOrCreate()
	return spk

# function to create DataFrame from data and schema
def create_df(spark, data, schema):
	df1 = spark.createDataFrame(data, schema)
	return df1


if __name__ == "__main__":

	# calling function to create SparkSession
	spark = create_session()

	# creating sample data with different data types
	input_data = [(1, "John", "Data Scientist", "Seattle"),
				(2, None, "Software Developer", None),
				(3, "Emma", "Data Analyst", "New York"),
				(4, None, None, "San Francisco"),
				(5, "Andrew", "Android Developer", "Los Angeles"),
				(6, "Sarah", None, None),
				(None, None, None, None)]
	
	# creating schema for DataFrame
	schema = ["Id", "Name", "Job Profile", "City"]

	# calling function to create dataframe
	df = create_df(spark, input_data, schema)

	# displaying the created DataFrame
	# df.show()

# if the subset column any value
# is NULL then we drop that row
df = df.dropna(subset="City")
df.show()

To run the above code we need to run the command shown below.

Command

python3 main.py

Once we run the above command, we can expect the output to the same as the one shown below.

Output

+---+------+-----------------+--------------------------------+
| Id|  Name|      Job Profile|             City|
+---+------+-----------------+--------------------------------+
|  1|  John|         Data Scientist|         Seattle|
|  3|  Emma|      Data Analyst|           New York|
|  4|  null|           null|                         San Francisco|
|  5|Andrew|      Android Developer|  Los Angeles|
+---+------+-----------------+--------------------------------+

Conclusion

In conclusion, cleaning data is an essential part of data preprocessing before any analysis or modeling. In Python, the dropna() function from the Pandas library and the PySpark DataFrame API provides an easy and efficient way to remove the rows or columns that contain Null values from a DataFrame.

By specifying different parameters such as how and thresh, users can choose the behavior of the function and customize the cleaning process. Overall, the dropna() function is a powerful tool for data cleaning that helps to improve data quality and enhance the accuracy of any subsequent analysis or modeling.

Mukul Latiyan

Updated on: 03-Aug-2023

214 Views

Kickstart Your Career

Get certified by completing the course

Get Started