Article Categories

Selected Reading

Full outer join in PySpark dataframe

Python PySpark Server Side Programming

A Full Outer Join is an operation that combines the results of a left outer join and a right outer join. In PySpark, it is used to join two dataframes based on a specific condition where all the records from both dataframes are included in the output regardless of whether there is a match or not. This article will provide a detailed explanation of how to perform a full outer join in PySpark and provide a practical example to illustrate its implementation.

Installation and Setup

Before we can perform a full outer join in PySpark, we need to set up a working environment. Firstly, we need to install PySpark by running the command "pip install pyspark" in the terminal. Secondly, we need to import the necessary modules by running the following commands ?

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

Syntax

The syntax for performing a full outer join in PySpark is as follows ?

df_full = df1.join(df2, (df1.column_name == df2.column_name), 'full')

Algorithm

First, we import the necessary modules i.e., SparkSession and col.
We create a SparkSession object using the builder() method and specify the app name and master node URL.
We create sample dataframes with different data sets to demonstrate the join operation.
We perform the full outer join operation using the join() method and passing the condition as a parameter.
We display the resultant dataframe using the show() method.

Example

Let's consider two dataframes with some overlapping and non-overlapping records to demonstrate a true full outer join. We'll create a "sales_df" containing sales information and a "customer_df" containing customer information ?

from pyspark.sql import SparkSession

# Create a SparkSession object
spark = SparkSession.builder.appName("Full Outer Join").getOrCreate()

# Create sample sales dataframe
data_sales = [("101", "Product1", 500), 
              ("102", "Product2", 750), 
              ("103", "Product3", 300),
              ("104", "Product4", 1200),
              ("105", "Product5", 900)]
df_sales = spark.createDataFrame(data_sales, ["customer_id", "product", "amount"])

# Create sample customers dataframe with some overlapping IDs
data_customers = [("101", "John"), 
                  ("102", "Jane"), 
                  ("106", "Mike"), 
                  ("107", "Emily"), 
                  ("108", "Bob")]
df_customers = spark.createDataFrame(data_customers, ["customer_id", "name"])

print("Sales DataFrame:")
df_sales.show()

print("Customers DataFrame:")
df_customers.show()

# Perform the full outer join operation
df_full = df_sales.join(df_customers, df_sales.customer_id == df_customers.customer_id, 'full')

print("Full Outer Join Result:")
df_full.show()

Sales DataFrame:
+-----------+--------+------+
|customer_id| product|amount|
+-----------+--------+------+
|        101|Product1|   500|
|        102|Product2|   750|
|        103|Product3|   300|
|        104|Product4|  1200|
|        105|Product5|   900|
+-----------+--------+------+

Customers DataFrame:
+-----------+-----+
|customer_id| name|
+-----------+-----+
|        101| John|
|        102| Jane|
|        106| Mike|
|        107|Emily|
|        108|  Bob|
+-----------+-----+

Full Outer Join Result:
+-----------+--------+------+-----------+-----+
|customer_id| product|amount|customer_id| name|
+-----------+--------+------+-----------+-----+
|        101|Product1|   500|        101| John|
|        102|Product2|   750|        102| Jane|
|        103|Product3|   300|       null| null|
|        104|Product4|  1200|       null| null|
|        105|Product5|   900|       null| null|
|       null|    null|  null|        106| Mike|
|       null|    null|  null|        107|Emily|
|       null|    null|  null|        108|  Bob|
+-----------+--------+------+-----------+-----+

How It Works

In this example, the full outer join returns:

Matching records: Customers 101 and 102 exist in both dataframes, so their complete information is shown
Left-only records: Customers 103, 104, and 105 have sales but no customer details (null values for name)
Right-only records: Customers 106, 107, and 108 have customer details but no sales (null values for product and amount)

Applications

Full outer joins are particularly useful in scenarios such as:

Data completeness analysis: Identifying missing relationships between datasets
Data integration: Combining data from multiple sources where some records may not have matches
Audit purposes: Finding discrepancies between related datasets

Conclusion

Full outer join is a powerful operation for combining data from two dataframes while preserving all records from both sides. In PySpark, use the join() method with 'full' parameter to include all records regardless of matching conditions, with null values filling gaps where no matches exist.

---

Atharva Shah

Updated on: 2026-03-27T06:21:27+05:30

2K+ Views

Previous Next