How to create an empty PySpark dataframe?


PySpark is a data processing framework built on top of Apache Spark, which is widely used for large-scale data processing tasks. It provides an efficient way to work with big data; it has data processing capabilities.

A PySpark dataFrame is a distributed collection of data organized into named columns. It is similar to a table in a relational database, with columns representing the features and rows representing the observations. A dataFrame can be created from various data sources, such as CSV, JSON, Parquet files, and existing RDDs (Resilient Distributed Datasets). However, sometimes it may be required to create an empty DataFrame for various reasons, such as initializing a schema or as a placeholder for future data. Here are the. In this tutorial, we illustrated two examples.

Syntax

To create an empty PySpark dataframe, we need to follow this syntax −

empty_df = spark.createDataFrame([], schema)

In this syntax, we pass an empty list of rows and the schema to the ‘createDataFrame()’ method, which returns an empty DataFrame.

Example

In this example, we create an empty DataFrame with a single column.

#Importing necessary modules
from pyspark.sql.types import StructType, StructField, IntegerType

#creating a SparkSession object
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("EmptyDataFrame").getOrCreate()

#Defining the schema of the dataframe.
schema = StructType([StructField("age", IntegerType(), True)])

#Creating an empty dataframe.
empty_df = spark.createDataFrame([], schema)

#Printing the output.
empty_df.show()

In this example, first, we defined a schema with a single column named "age" of IntegerType; after then, we created an empty DataFrame with that schema. Finally, we display the empty DataFrame using the ‘show()’ method.

Output

+---+
|age|
+---+
+---+

Example

In this example, we are creating an empty DataFrame with multiple columns.

#Importing the necessary modules.
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql import SparkSession

#Creating a SparkSession object.
spark = SparkSession.builder.appName("EmptyDataFrame").getOrCreate()

#Defining the schema of the DataFrame
schema = StructType([
   StructField("col_1", StringType(), True),
   StructField("col_2", StringType(), True),
   StructField("col_3", StringType(), True),
   StructField("col_4", StringType(), True),
   StructField("col_5", StringType(), True),
   StructField("col_6", StringType(), True),
   StructField("col_7", StringType(), True),
   StructField("col_8", StringType(), True),
   StructField("col_9", StringType(), True),
   StructField("col_10", IntegerType(), True)
])

#Creating an empty DataFrame.
empty_df = spark.createDataFrame([], schema)

#Printing the output.
empty_df.show(10000) 

In this example, we first defined a schema with ten columns named "col_1" to "col_10" of ‘StringType’ and ‘IntegerType’, then created an empty DataFrame with that schema. Finally, we displayed the empty data frame using the ‘show()’ method with many rows (10,000) to demonstrate that the data frame is indeed empty.

We noted that even though the output shows 10,000 rows, the DataFrame is empty as no values exist in any of the columns

Output

+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|col_10|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+

In this tutorial, we learned to create an empty PySpark DataFrame using the ‘createDataFrame()’ method. We illustrated two examples, including creating an empty DataFrame with a single column, creating an empty DataFrame with multiple columns. To create an empty DataFrame, we first defined a schema using ‘StructType()’ and ‘StructField()’ and then passed it as an argument to the ‘createDataFrame()’ method along with an empty list ‘[]’. This creates an empty DataFrame with the specified schema. By creating an empty PySpark DataFrame, we can set up the structure of the DataFrame in advance and then populate it with data as needed. This can be useful in dealing with large datasets where the data structure is known in advance, but the data itself is not yet available.

Updated on: 10-Apr-2023

8K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements