Full outer join in PySpark dataframe

A Full Outer Join is an operation that combines the results of a left outer join and a right outer join. In PySpark, it is used to join two dataframes based on a specific condition where all the records from both dataframes are included in the output regardless of whether there is a match or not. This article will provide a detailed explanation of how to perform a full outer join in PySpark and provide a practical example to illustrate its implementation.

Installation and Setup

Before we can perform a full outer join in PySpark, we need to set up a working environment. Firstly, we need to install PySpark by running the command "pip install pyspark" in the terminal. Secondly, we need to import the necessary modules by running the following commands ?

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

Syntax

The syntax for performing a full outer join in PySpark is as follows ?

df_full = df1.join(df2, (df1.column_name == df2.column_name), 'full')

Algorithm

First, we import the necessary modules i.e., SparkSession and col.
We create a SparkSession object using the builder() method and specify the app name and master node URL.
We read the data from the CSV files and convert them into dataframes using the read.csv() method. For this example, we are sticking to dummy dataframes.
We perform the full outer join operation using the join() method and passing the condition as a parameter.
We display the resultant dataframe using the show() method.

Example

Let's consider two dataframes, "sales_df" and "customer_df". The "sales_df" contains information about the sales made by a company, while the "customer_df" contains information about the customers who made the purchases. We want to join these two dataframes on the "customer_id" column and get all the records from both dataframes.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Create a SparkSession object
spark = SparkSession.builder.appName("Full Outer Join").getOrCreate()

# Create sample dataframes
data_sales = [("S1", "Product1", 100), 
              ("S2", "Product2", 200), 
              ("S3", "Product3", 300),
              ("S4", "Product4", 400),
              ("S5", "Product5", 500),
              ("S6", "Product6", 600),
              ("S7", "Product7", 700),
              ("S8", "Product8", 800),
              ("S9", "Product9", 900),
              ("S10", "Product10", 1000)]
df_sales = spark.createDataFrame(data_sales, ["sale_id", "product", "amount"])

data_customers = [("C1", "John"), 
                  ("C2", "Jane"), 
                  ("C3", "Mike"), 
                  ("C4", "Emily"), 
                  ("C5", "Bob"), 
                  ("C6", "Alice"),
                  ("C7", "Dave"), 
                  ("C8", "Jenny"), 
                  ("C9", "Peter"), 
                  ("C10", "Sarah")]
df_customers = spark.createDataFrame(data_customers, ["customer_id", "name"])

# Perform the full outer join operation
df_full = df_sales.join(df_customers, (df_sales.sale_id == df_customers.customer_id), 'full')

# Display the resultant dataframe
df_full.show()

Output

sale_id	product	amount	customer_id	name
S1	      Product1	 100	       C1	   John
S2	      Product2	 200	       C2	   Jane
S3	      Product3	 300	       C3	   Mike
S4	      Product4	 400	       C4	   Emily
S5	      Product5	 500	       C5	   Bob
S6	      Product6	 600	       C6	   Alice
S7	      Product7	 700	       C7	   Dave
S8	      Product8	 800	       C8	   Jenny
S9	      Product9	 900	       C9	   Peter
S10	      Product10 1000	       C10	   Sarah

With 10 sets of example data in each, the two dataframes df sales and df customers are created by this code. Sale id, object, and amount are the three variables in the df sales dataframe. Customer id and name are the two variables in the df customers dataframe. The join() method is then used with the full join type to execute a complete outer join procedure between the two dataframes. The customer id field in df customers and the sales id column in df sales must coincide for there to be a join.

The script then uses the show() method to present the final dataframe. Columns from both dataframes are displayed in the combined dataframe df full in this way. The missing values are replaced with nulls if an entry in one of the dataframes does not have a corresponding record in the other dataframe.

Applications

When working with big databases that could contain missing data or null values, a complete outer join is a useful operation. It can be applied in a wide range of situations, including data cleansing, combining data from various sources, and assessing data from various areas.

Conclusion

Data from two dataframes can be combined using the robust operation known as a full outer join based on a predetermined circumstance. By providing the condition as a parameter to the join() function in PySpark, we can execute a complete outer join. You can simply conduct a full outer join in PySpark by following the instructions provided in this article and leveraging its advantages in your data analysis and processing tasks.

Atharva Shah

Updated on: 2023-05-08T16:54:04+05:30

2K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started