PySpark Articles

Page 2 of 2

Full outer join in PySpark dataframe

Atharva Shah
Atharva Shah
Updated on 27-Mar-2026 2K+ Views

A Full Outer Join is an operation that combines the results of a left outer join and a right outer join. In PySpark, it is used to join two dataframes based on a specific condition where all the records from both dataframes are included in the output regardless of whether there is a match or not. This article will provide a detailed explanation of how to perform a full outer join in PySpark and provide a practical example to illustrate its implementation. Installation and Setup Before we can perform a full outer join in PySpark, we need to ...

Read More

Drop rows containing specific value in pyspark dataframe

Devesh Chauhan
Devesh Chauhan
Updated on 27-Mar-2026 1K+ Views

When dealing with large datasets, PySpark provides powerful tools for data processing and manipulation. PySpark is Apache Spark's Python API that allows you to work with distributed data processing in your local Python environment. In this tutorial, we'll learn how to drop rows containing specific values from a PySpark DataFrame using different methods. This selective data elimination is essential for data cleaning and maintaining data relevance. Creating a Sample PySpark DataFrame First, let's create a sample DataFrame to demonstrate the row dropping techniques ? from pyspark.sql import SparkSession # Create SparkSession spark = SparkSession.builder.appName("DropRowsDemo").getOrCreate() ...

Read More

Drop One or Multiple Columns From PySpark DataFrame

Devesh Chauhan
Devesh Chauhan
Updated on 27-Mar-2026 1K+ Views

A PySpark DataFrame is a distributed data structure built on Apache Spark that provides powerful data processing capabilities. Sometimes you need to remove unnecessary columns to optimize performance or focus on specific data. PySpark offers several methods to drop one or multiple columns from a DataFrame. Creating a PySpark DataFrame First, let's create a sample DataFrame to demonstrate column dropping operations ? from pyspark.sql import SparkSession import pandas as pd # Create SparkSession spark = SparkSession.builder.appName("DropColumns").getOrCreate() # Sample dataset dataset = { "Device name": ["Laptop", "Mobile phone", "TV", "Radio"], ...

Read More

Drop duplicate rows in PySpark DataFrame

Devesh Chauhan
Devesh Chauhan
Updated on 27-Mar-2026 586 Views

PySpark is a Python API for Apache Spark, designed to process large-scale data in real-time with distributed computing capabilities. Unlike regular DataFrames, PySpark DataFrames distribute data across clusters and follow a strict schema for optimized processing. In this article, we'll explore different methods to drop duplicate rows from PySpark DataFrames using distinct() and dropDuplicates() functions. Installation Install PySpark using pip ? pip install pyspark Creating a PySpark DataFrame First, let's create a sample DataFrame with duplicate rows to demonstrate the deduplication methods ? from pyspark.sql import SparkSession import pandas as ...

Read More

Creating a PySpark DataFrame

Tamoghna Das
Tamoghna Das
Updated on 27-Mar-2026 2K+ Views

PySpark is a powerful Python API for Apache Spark that enables distributed data processing. The DataFrame is a fundamental data structure in PySpark, providing a structured way to work with large datasets across multiple machines. What is PySpark and Its Key Advantages? PySpark combines Python's simplicity with Apache Spark's distributed computing capabilities. Key advantages include − Scalability − Handle large datasets and scale up or down based on processing needs Speed − Fast data processing through in-memory computation and parallel execution Fault tolerance − Automatic recovery from hardware or software failures Flexibility − Support for batch ...

Read More

How to check if something is a RDD or a DataFrame in PySpark?

Niharika Aitam
Niharika Aitam
Updated on 20-Oct-2023 1K+ Views

RDD is abbreviated as Resilient Distributed Dataset, which is PySpark fundamental abstraction (Immutable collection of objects). The RDD’s are the primary building blocks of the PySpark. They split into smaller chunks and distributed among the nodes in a cluster. It supports the operations of transformations and actions. Dataframe in PySpark DataFrame is a two dimensional labeled data structure in python. It is used for data manipulation and data analysis. It accepts different datatypes such as integer, float, strings etc. The column labels are unique, while the rows are labeled with a unique index value that facilitates accessing specific rows. ...

Read More

How to create an empty PySpark dataframe?

Manthan Ghasadiya
Manthan Ghasadiya
Updated on 10-Apr-2023 15K+ Views

PySpark is a data processing framework built on top of Apache Spark, which is widely used for large-scale data processing tasks. It provides an efficient way to work with big data; it has data processing capabilities. A PySpark dataFrame is a distributed collection of data organized into named columns. It is similar to a table in a relational database, with columns representing the features and rows representing the observations. A dataFrame can be created from various data sources, such as CSV, JSON, Parquet files, and existing RDDs (Resilient Distributed Datasets). However, sometimes it may be required to create an ...

Read More
Showing 11–17 of 17 articles
« Prev 1 2 Next »
Advertisements