Article Categories

Selected Reading

How to Create a PySpark Dataframe from Multiple Lists ?

Python PySpark Programming

PySpark is a powerful tool for processing large datasets in a distributed computing environment. One of the fundamental tasks in data analysis is to convert data into a format that can be easily processed and analysed. In PySpark, data is typically stored in a DataFrame, which is a distributed collection of data organised into named columns.

In some cases, we may want to create a PySpark DataFrame from multiple lists. This can be useful when we have data in a format that is not easily loaded from a file or database. For example, we may have data stored in Python lists or NumPy arrays that we want to convert to a PySpark DataFrame for further analysis.

In this article, we will explore how to create a PySpark DataFrame from multiple lists using different approaches with practical examples.

Method 1: Using zip() with List of Tuples

The simplest approach is to combine multiple lists using zip() and create a DataFrame directly ?

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Initialize SparkSession
spark = SparkSession.builder.appName("Create DataFrame from Lists").getOrCreate()

# Define the data as lists
names = ["Alice", "Bob", "Charlie", "David"]
ages = [25, 30, 35, 40]
genders = ["Female", "Male", "Male", "Male"]

# Combine lists using zip and convert to list of tuples
data = list(zip(names, ages, genders))

# Define the schema
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Gender", StringType(), True)
])

# Create PySpark DataFrame
df = spark.createDataFrame(data, schema)

# Show the DataFrame
df.show()

Output

+-------+---+------+
|   Name|Age|Gender|
+-------+---+------+
|  Alice| 25|Female|
|    Bob| 30|  Male|
|Charlie| 35|  Male|
|  David| 40|  Male|
+-------+---+------+

Method 2: Using NumPy Array

Another approach is to convert the lists to a NumPy array and then create a PySpark DataFrame. This method is useful for numerical data ?

import numpy as np
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType

# Initialize SparkSession
spark = SparkSession.builder.appName("NumPy to DataFrame").getOrCreate()

# Define the lists
age = [20, 25, 30, 35, 40]
salary = [25000, 35000, 45000, 55000, 65000]

# Convert the lists to a NumPy array and transpose
data = np.array([age, salary]).T

# Define the schema
schema = StructType([
    StructField("age", IntegerType(), True),
    StructField("salary", IntegerType(), True)
])

# Create the PySpark DataFrame
df = spark.createDataFrame(data.tolist(), schema=schema)

# Show the DataFrame
df.show()

Output

+---+------+
|age|salary|
+---+------+
| 20| 25000|
| 25| 35000|
| 30| 45000|
| 35| 55000|
| 40| 65000|
+---+------+

Method 3: Using List Comprehension

You can also create a DataFrame by building a list of tuples using list comprehension for more complex data transformations ?

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

# Initialize SparkSession
spark = SparkSession.builder.appName("List Comprehension Method").getOrCreate()

# Define the data as lists
names = ["Alice", "Bob", "Charlie", "David"]
ages = [25, 30, 35, 40]
salaries = [50000.0, 60000.0, 70000.0, 80000.0]

# Create data using list comprehension
data = [(names[i], ages[i], salaries[i]) for i in range(len(names))]

# Define the schema
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Salary", DoubleType(), True)
])

# Create PySpark DataFrame
df = spark.createDataFrame(data, schema)

# Show the DataFrame
df.show()

Output

+-------+---+-------+
|   Name|Age| Salary|
+-------+---+-------+
|  Alice| 25|50000.0|
|    Bob| 30|60000.0|
|Charlie| 35|70000.0|
|  David| 40|80000.0|
+-------+---+-------+

Comparison of Methods

Method	Best For	Performance	Flexibility
zip()	Simple data combination	Fast	Good
NumPy Array	Numerical data	Fast	Limited
List Comprehension	Complex transformations	Medium	High

Conclusion

Creating PySpark DataFrames from multiple lists can be accomplished using several methods. Use zip() for simple combinations, NumPy arrays for numerical data, and list comprehension when you need data transformations. Always define a proper schema for better performance and type safety.

Mukul Latiyan

Updated on: 2026-03-27T11:07:04+05:30

3K+ Views

Previous Next

Article Categories

How to Create a PySpark Dataframe from Multiple Lists ?

Method 1: Using zip() with List of Tuples

Output

Method 2: Using NumPy Array

Output

Method 3: Using List Comprehension

Output

Comparison of Methods

Conclusion

Learn More in Our Tutorials