How to Create a PySpark Dataframe from Multiple Lists ?


PySpark is a powerful tool for processing large datasets in a distributed computing environment. One of the fundamental tasks in data analysis is to convert data into a format that can be easily processed and analysed. In PySpark, data is typically stored in a DataFrame, which is a distributed collection of data organised into named columns.

In some cases, we may want to create a PySpark DataFrame from multiple lists. This can be useful when we have data in a format that is not easily loaded from a file or database. For example, we may have data stored in Python lists or NumPy arrays that we want to convert to a PySpark DataFrame for further analysis.

In this article, we will explore how to create a PySpark DataFrame from multiple lists. We will discuss different approaches and provide code examples with comments and outputs for each approach.

Convert lists to a NumPy array and then to a PySpark DataFrame

One approach to create a PySpark DataFrame from multiple lists is to first convert the lists to a NumPy array and then create a PySpark DataFrame from the NumPy array using the createDataFrame() function. This approach requires the pyspark.sql.types module to specify the schema of the DataFrame.

Consider the code shown below.

Example

import numpy as np
from pyspark.sql.types import StructType, StructField, IntegerType

# Define the lists
age = [20, 25, 30, 35, 40]
salary = [25000, 35000, 45000, 55000, 65000]

# Convert the lists to a NumPy array
data = np.array([age, salary]).T

# Define the schema
schema = StructType([
	StructField("age", IntegerType(), True),
	StructField("salary", IntegerType(), True)
])

# Create the PySpark DataFrame
df = spark.createDataFrame(data.tolist(), schema=schema)

# Show the DataFrame
df.show()

Explanation

  • First, we import the required modules − numpy and pyspark.sql.types.

  • Next, we define the two lists: age and salary.

  • We then convert the lists to a NumPy array using the np.array() function and transpose the array using .T.

  • After that, we define the schema of the DataFrame using the StructType() and StructField() functions. In this case, we define two columns − age and salary − with IntegerType() data type.

  • Finally, we create the PySpark DataFrame using the createDataFrame() function and passing the NumPy array converted to a list and the schema as parameters. We then show the DataFrame using the show() function.

Output

+---+------+
|age|salary|
+---+------+
| 20| 25000|
| 25| 35000|
| 30| 45000|
| 35| 55000|
| 40| 65000|
+---+------+

Using PySpark's createDataFrame() method

In this approach, we will create a PySpark dataframe directly from the lists using the createDataFrame() method provided by PySpark. We will first create a list of tuples, where each tuple represents a row in the dataframe. Then, we will create a schema that defines the structure of the dataframe, i.e., the column names and data types. Finally, we will create a dataframe using the createDataFrame() method by passing the list of tuples and the schema as arguments.

Consider the code shown below.

Example

from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.appName("Create DataFrame from Lists").getOrCreate()

# Define the data as lists
names = ["Alice", "Bob", "Charlie", "David"]
ages = [25, 30, 35, 40]
genders = ["Female", "Male", "Male", "Male"]

# Define the schema of the dataframe
schema = StructType([
	StructField("Name", StringType(), True),
	StructField("Age", IntegerType(), True),
	StructField("Gender", StringType(), True)
])

# Create a list of tuples
data = [(names[i], ages[i], genders[i]) for i in range(len(names))]

# Create a PySpark dataframe
df = spark.createDataFrame(data, schema)

# Show the dataframe
df.show()

Explanation

  • First, we import the required modules − numpy and pyspark.sql.types.

  • Next, we define the two lists: age and salary.

  • We then convert the lists to a NumPy array using the np.array() function and transpose the array using .T.

  • After that, we define the schema of the DataFrame using the StructType() and StructField() functions. In this case, we define two columns − age and salary − with IntegerType() data type.

  • Finally, we create the PySpark DataFrame using the createDataFrame() function and passing the NumPy array converted to a list and the schema as parameters. We then show the DataFrame using the show() function.

Output

+-------+---+---------------+
| Name   |Age|   Gender|
+-------+---+----------------+
|  Alice    |  25  |  Female |
|  Bob      |  30  |  Male   |
|  Charlie  |  35  |  Male   |
|  David    |  40  |  Male   |
+-------+---+---------------+

Conclusion

In this article, we explored two different approaches to create a PySpark dataframe from multiple lists. The first approach used the Row() function to create rows of data and then created the dataframe using the createDataFrame() method. The second approach used the StructType() and StructField() functions to define a schema and then created the dataframe using the createDataFrame() method with the data and schema as arguments.

Updated on: 03-Aug-2023

970 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements