- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
Physics
Chemistry
Biology
Mathematics
English
Economics
Psychology
Social Studies
Fashion Studies
Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Full outer join in PySpark dataframe
A Full Outer Join is an operation that combines the results of a left outer join and a right outer join. In PySpark, it is used to join two dataframes based on a specific condition where all the records from both dataframes are included in the output regardless of whether there is a match or not. This article will provide a detailed explanation of how to perform a full outer join in PySpark and provide a practical example to illustrate its implementation.
Installation and Setup
Before we can perform a full outer join in PySpark, we need to set up a working environment. Firstly, we need to install PySpark by running the command "pip install pyspark" in the terminal. Secondly, we need to import the necessary modules by running the following commands −
from pyspark.sql import SparkSession from pyspark.sql.functions import col
Syntax
The syntax for performing a full outer join in PySpark is as follows −
df_full = df1.join(df2, (df1.column_name == df2.column_name), 'full')
Algorithm
First, we import the necessary modules i.e., SparkSession and col.
We create a SparkSession object using the builder() method and specify the app name and master node URL.
We read the data from the CSV files and convert them into dataframes using the read.csv() method. For this example, we are sticking to dummy dataframes.
We perform the full outer join operation using the join() method and passing the condition as a parameter.
We display the resultant dataframe using the show() method.
Example
Let's consider two dataframes, "sales_df" and "customer_df". The "sales_df" contains information about the sales made by a company, while the "customer_df" contains information about the customers who made the purchases. We want to join these two dataframes on the "customer_id" column and get all the records from both dataframes.
from pyspark.sql import SparkSession from pyspark.sql.functions import col # Create a SparkSession object spark = SparkSession.builder.appName("Full Outer Join").getOrCreate() # Create sample dataframes data_sales = [("S1", "Product1", 100), ("S2", "Product2", 200), ("S3", "Product3", 300), ("S4", "Product4", 400), ("S5", "Product5", 500), ("S6", "Product6", 600), ("S7", "Product7", 700), ("S8", "Product8", 800), ("S9", "Product9", 900), ("S10", "Product10", 1000)] df_sales = spark.createDataFrame(data_sales, ["sale_id", "product", "amount"]) data_customers = [("C1", "John"), ("C2", "Jane"), ("C3", "Mike"), ("C4", "Emily"), ("C5", "Bob"), ("C6", "Alice"), ("C7", "Dave"), ("C8", "Jenny"), ("C9", "Peter"), ("C10", "Sarah")] df_customers = spark.createDataFrame(data_customers, ["customer_id", "name"]) # Perform the full outer join operation df_full = df_sales.join(df_customers, (df_sales.sale_id == df_customers.customer_id), 'full') # Display the resultant dataframe df_full.show()
Output
sale_id product amount customer_id name S1 Product1 100 C1 John S2 Product2 200 C2 Jane S3 Product3 300 C3 Mike S4 Product4 400 C4 Emily S5 Product5 500 C5 Bob S6 Product6 600 C6 Alice S7 Product7 700 C7 Dave S8 Product8 800 C8 Jenny S9 Product9 900 C9 Peter S10 Product10 1000 C10 Sarah
With 10 sets of example data in each, the two dataframes df sales and df customers are created by this code. Sale id, object, and amount are the three variables in the df sales dataframe. Customer id and name are the two variables in the df customers dataframe. The join() method is then used with the full join type to execute a complete outer join procedure between the two dataframes. The customer id field in df customers and the sales id column in df sales must coincide for there to be a join.
The script then uses the show() method to present the final dataframe. Columns from both dataframes are displayed in the combined dataframe df full in this way. The missing values are replaced with nulls if an entry in one of the dataframes does not have a corresponding record in the other dataframe.
Applications
When working with big databases that could contain missing data or null values, a complete outer join is a useful operation. It can be applied in a wide range of situations, including data cleansing, combining data from various sources, and assessing data from various areas.
Conclusion
Data from two dataframes can be combined using the robust operation known as a full outer join based on a predetermined circumstance. By providing the condition as a parameter to the join() function in PySpark, we can execute a complete outer join. You can simply conduct a full outer join in PySpark by following the instructions provided in this article and leveraging its advantages in your data analysis and processing tasks.