Cleaning Data with Apache Spark in Python


In today's time, when we have high volume and velocities of data flowing, Apache Spark, an open source big data processing framework, is a common choice as it allows parallel and distributed processing of data. Cleaning of such data is an important step and Apache Spark provides us with a variety of tools and methods for the cleaning of data. In this method, we are going to be seeing how to clean data with Apache Spark in Python and the steps to do so are as follows:

  • Loading the data into a Spark DataFrame − The SparkSession.read method allows us to read data from various sources like CSV, JSON, Parquet etc.

  • Handling of missing Values − The DataFrame.dropna or DataFrame.fillna methods allow us to drop the data with any missing values or fill the missing values with a particular value respectively.

  • Handling the duplicates − Often data contains duplicate entries. To handle this, the DataFrame.dropDuplicates method allows us to remove duplicates from the DataFrame.

  • Handling the outliers − The DataFrame.filter method allows us to remove all those rows that contain outliers.

  • Handling the outliers − To convert the data types of columns, we have a method called DataFrame.cast

Before going forward and seeing how to clean data with pyspark, we will have to compulsorily install the pyspark library. To accomplish this, we must run the following command in the terminal:

pip install pyspark

Handling Missing Values

Handling missing values in Apache Spark involves identifying and dealing with missing or incomplete data in a data set stored in a Spark DataFrame. There are several methods for handling missing values in Spark, including:

  • Dropping rows − This involves removing records that contain missing values from the DataFrame.

  • Impute missing values − This involves replacing missing values with a calculated value such as the mean, median, or mode of the data in the column.

  • Fill missing values − This involves replacing missing values with a specific value, such as zero, or a default value.

  • Interpolate missing values − This involves using mathematical methods such as linear interpolation or spline interpolation to estimate missing values.

The method used for handling missing values depends on the specific requirements and goals of the data analysis. It is important to handle missing values in a consistent and repeatable manner to ensure the integrity of the data and the accuracy of the results.

In Apache Spark, the functions provided by the pyspark.sql.DataFrame and pyspark.sql.DataFrameNaFunctions modules can be used to handle missing values. These functions include dropna, fillna, and interpolate.

Example

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MissingData").getOrCreate()

# Create a sample data frame
data = [("John", 25, None), ("Jane", 30, 35.5), ("Jim", None, 40.0), ("Joan", 32, None)]
columns = ["Name", "Age", "Salary"]
df = spark.createDataFrame(data, columns)

# Display the original data frame
print("Original Data Frame:")
df.show()

# Replacing Missing Values with Mean
from pyspark.sql.functions import mean
mean_age = df.agg(mean(df["Age"])).first()[0]
mean_salary = df.agg(mean(df["Salary"])).first()[0]
df = df.fillna({"Age": mean_age, "Salary": mean_salary})

# Display the cleaned data frame
print("Cleaned Data Frame:")
df.show()

spark.stop()

Output

Original Data Frame:
+----+----+------+
|Name| Age|Salary|
+----+----+------+
|John|  25|  null|
|Jane|  30|  35.5|
| Jim|null|  40.0|
|Joan|  32|  null|
+----+----+------+

Cleaned Data Frame:
+----+---+------+
|Name|Age|Salary|
+----+---+------+
|John| 25| 37.75|
|Jane| 30|  35.5|
| Jim| 29|  40.0|
|Joan| 32| 37.75|
+----+---+------+

Handling Duplicates

Handling duplicates in Apache Spark involves identifying and dealing with duplicate records in a data set stored in a Spark DataFrame. There are several methods for handling duplicates in Spark, including:

  • Removing duplicates − This involves identifying and removing duplicate records from the DataFrame. The dropDuplicates function can be used to remove duplicate records in Spark.

  • Retaining duplicates − This involves retaining all instances of duplicate records in the DataFrame, usually by adding a unique identifier or index to each record.

  • Marking duplicates − This involves marking duplicate records in the DataFrame, but not removing them, so that they can be further analyzed or processed.

The method used for handling duplicates depends on the specific requirements and goals of the data analysis. It is important to handle duplicates in a consistent and repeatable manner to ensure the integrity of the data and the accuracy of the results.

In Apache Spark, the dropDuplicates function can be used to remove duplicate records in a DataFrame. The function takes one or more columns as input, and removes all records where the values in the specified columns are identical. The dropDuplicates function returns a new DataFrame with the duplicate records removed.

Example

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DuplicateData").getOrCreate()

# Create a sample data frame
data = [("John", 25, 90.0), ("Jane", 30, 35.5), ("Jim", 20, 200.0), ("Joan", 32, 50.0),
   ("John", 25, 90.0), ("Jim", 20, 200.0)]
columns = ["Name", "Age", "Salary"]
df = spark.createDataFrame(data, columns)

# Display the original data frame
print("Original Data Frame:")
df.show()

# Remove duplicates
df = df.dropDuplicates()

# Display the cleaned data frame
print("Cleaned Data Frame:")
df.show()

spark.stop()

Output

Original Data Frame:
+----+---+------+
|Name|Age|Salary|
+----+---+------+
|John| 25|  90.0|
|Jane| 30|  35.5|
| Jim| 20| 200.0|
|Joan| 32|  50.0|
|John| 25|  90.0|
| Jim| 20| 200.0|
+----+---+------+

Cleaned Data Frame:
+----+---+------+
|Name|Age|Salary|
+----+---+------+
|Jane| 30|  35.5|
|John| 25|  90.0|
| Jim| 20| 200.0|
|Joan| 32|  50.0|
+----+---+------+

Handling Outliers

Handling outliers in Apache Spark refers to the process of identifying and either removing or transforming values in a dataset that are considered extreme or outside of the normal range of values. Outliers can have a significant impact on the results of statistical analyses, so it is often necessary to handle them in some way.

There are several common approaches to handling outliers in Apache Spark, including:

Removing records containing outliers: This involves filtering out records where the value of a particular column is outside of a specified range or falls outside of a certain number of standard deviations from the mean.

  • Replacing outliers with the mean or median − This involves replacing values that are considered outliers with the mean or median of the remaining values in the column.

  • Winsorizing outliers − This involves replacing values that are outliers with a specified percentile value, such as the 5th or 95th percentile.

  • Clipping outliers − This involves replacing values that are outliers with a specified maximum or minimum value.

To handle outliers in Apache Spark, you can use the built-in functions available in the pyspark.sql.functions module to calculate statistics such as the mean and standard deviation, and then use the filter or withColumn methods to remove or replace outliers as desired.

Example

from pyspark.sql import SparkSession
from pyspark.sql.functions import mean, stddev, abs

spark = SparkSession.builder.appName("OutlierHandlingExample").getOrCreate()

# Create a sample data frame
data = [("John", 25, 90.0), ("Jane", 30, 35.5), ("Jim", 20, 200.0), ("Joan", 32, 50.0)]
columns = ["Name", "Age", "Salary"]
df = spark.createDataFrame(data, columns)

# Display the original data frame
print("Original Data Frame:")
df.show()

# Calculate mean and standard deviation
mean_salary = df.agg(mean(df["Salary"])).first()[0]
stddev_salary = df.agg(stddev(df["Salary"])).first()[0]

# Identify and filter out outliers
df = df.filter(abs(df["Salary"] - mean_salary) < stddev_salary)

# Display the cleaned data frame
print("Cleaned Data Frame:")
df.show()

spark.stop()

Output

Original Data Frame:
+----+---+------+
|Name|Age|Salary|
+----+---+------+
|John| 25|  90.0|
|Jane| 30|  35.5|
| Jim| 20| 200.0|
|Joan| 32|  50.0|
+----+---+------+

Cleaned Data Frame:
+----+---+------+
|Name|Age|Salary|
+----+---+------+
|John| 25|  90.0|
|Jane| 30|  35.5|
|Joan| 32|  50.0|
+----+---+------+

Converting Data Types

Converting data types refers to the process of changing the representation of data from one data type to another. In data processing and analysis, it is common to encounter data in different formats that are not suitable for the desired analysis. In such cases, converting the data type to a suitable format is necessary to perform the analysis correctly.

For example, in a data frame, a column may have the data type of string, but the values in that column are numbers. In this case, the data type of the column needs to be changed to integer or float, depending on the requirements of the analysis. Similarly, a column may have the data type of integer, but the values in that column are date strings. In this case, the data type of the column needs to be changed to a date type.

Converting data types is an important step in data cleaning and preprocessing, as it ensures that the data is in the correct format for the analysis.

Example

from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, FloatType

spark = SparkSession.builder.appName("DataTypeConversion").getOrCreate()

# Create a sample data frame
data = [("John", "25", "90"), ("Jane", "30", "35"), ("Jim", "20", "200"), ("Joan", "32", "50")]
columns = ["Name", "Age", "Salary"]
df = spark.createDataFrame(data, columns)

# Display the original data frame
print("Original Data Frame:")
df.show()

# Convert the data type of the 'Age' column to integer
df = df.withColumn("Age", df["Age"].cast(IntegerType()))

# Convert the data type of the 'Salary' column to float
df = df.withColumn("Salary", df["Salary"].cast(FloatType()))

# Display the converted data frame
print("Converted Data Frame:")
df.show()

spark.stop()

Output

Original Data Frame:
+----+---+------+
|Name|Age|Salary|
+----+---+------+
|John| 25|   90|
|Jane| 30|   35|
| Jim| 20|   200|
|Joan| 32|   50|
+----+---+------+

Converted Data Frame:
+----+---+------+
|Name|Age|Salary|
+----+---+------+
|John| 25|  90.0|
|Jane| 30|  35.0|
| Jim| 20| 200.0|
|Joan| 32|  50.0|
+----+---+------+

Conclusion

Cleaning of data in Apache Spark is an essential part of the data preparation process. Apache Spark provides us a powerful as well as an efficient platform in order to process large datasets and help us perform various data cleaning tasks like handling of missing values, duplicates etc at the same time. The pyspark.sql.functions module provides us with a large number of functions which when combined with the ability of performing complex operations in a distributed environment, makes Apache Spark a perfect choice for data cleaning and preparation. By making use of the right tools and techniques, we can prepare our data for analysis,ML or any other type of data driven application with a higher chance of accurate results.

Updated on: 04-Oct-2023

245 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements