How to Convert a list of Dictionaries into Pyspark DataFrame?

PySpark allows you to convert Python data structures into distributed DataFrames for big data processing. Converting a list of dictionaries into a PySpark DataFrame is a common task when working with structured data in distributed computing environments.

In this tutorial, we will explore the step-by-step process of converting a list of dictionaries into a PySpark DataFrame using PySpark's DataFrame API.

Prerequisites

Before starting, ensure you have PySpark installed and a SparkSession created ?

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Create SparkSession
spark = SparkSession.builder.appName("DictToDataFrame").getOrCreate()

Sample Data

Let's create a sample list of dictionaries representing employee information ?

# Sample list of dictionaries
employee_data = [
    {"name": "Prince", "age": 30, "department": "Engineering"},
    {"name": "Mukul", "age": 35, "department": "Sales"},
    {"name": "Durgesh", "age": 28, "department": "Marketing"},
    {"name": "Doku", "age": 32, "department": "Finance"}
]

print("Original data:")
for emp in employee_data:
    print(emp)
Original data:
{'name': 'Prince', 'age': 30, 'department': 'Engineering'}
{'name': 'Mukul', 'age': 35, 'department': 'Sales'}
{'name': 'Durgesh', 'age': 28, 'department': 'Marketing'}
{'name': 'Doku', 'age': 32, 'department': 'Finance'}

Method 1: Using createDataFrame() Directly

The simplest approach is to use createDataFrame() method directly on the list of dictionaries ?

from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName("DictToDataFrame").getOrCreate()

# Sample data
employee_data = [
    {"name": "Prince", "age": 30, "department": "Engineering"},
    {"name": "Mukul", "age": 35, "department": "Sales"},
    {"name": "Durgesh", "age": 28, "department": "Marketing"},
    {"name": "Doku", "age": 32, "department": "Finance"}
]

# Create DataFrame directly
df = spark.createDataFrame(employee_data)
df.show()
+-------+---+------------+
|   name|age|  department|
+-------+---+------------+
| Prince| 30| Engineering|
|  Mukul| 35|       Sales|
|Durgesh| 28|   Marketing|
|   Doku| 32|     Finance|
+-------+---+------------+

Method 2: Using RDD with Schema

For more control over data types, create an RDD first and then apply a custom schema ?

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Create SparkSession
spark = SparkSession.builder.appName("DictToDataFrame").getOrCreate()

# Sample data
employee_data = [
    {"name": "Prince", "age": 30, "department": "Engineering"},
    {"name": "Mukul", "age": 35, "department": "Sales"},
    {"name": "Durgesh", "age": 28, "department": "Marketing"},
    {"name": "Doku", "age": 32, "department": "Finance"}
]

# Step 1: Create RDD from list of dictionaries
rdd = spark.sparkContext.parallelize(employee_data)

# Step 2: Define schema
schema = StructType([
    StructField("name", StringType(), nullable=False),
    StructField("age", IntegerType(), nullable=False),
    StructField("department", StringType(), nullable=False)
])

# Step 3: Create DataFrame with schema
df = spark.createDataFrame(rdd, schema)
df.show()

# Display schema information
df.printSchema()
+-------+---+------------+
|   name|age|  department|
+-------+---+------------+
| Prince| 30| Engineering|
|  Mukul| 35|       Sales|
|Durgesh| 28|   Marketing|
|   Doku| 32|     Finance|
+-------+---+------------+

root
 |-- name: string (nullable = false)
 |-- age: integer (nullable = false)
 |-- department: string (nullable = false)

DataFrame Operations

Once created, you can perform various operations on the DataFrame ?

from pyspark.sql import SparkSession

# Create SparkSession and DataFrame
spark = SparkSession.builder.appName("DictToDataFrame").getOrCreate()
employee_data = [
    {"name": "Prince", "age": 30, "department": "Engineering"},
    {"name": "Mukul", "age": 35, "department": "Sales"},
    {"name": "Durgesh", "age": 28, "department": "Marketing"},
    {"name": "Doku", "age": 32, "department": "Finance"}
]
df = spark.createDataFrame(employee_data)

# Filter employees older than 30
print("Employees older than 30:")
df.filter(df.age > 30).show()

# Select specific columns
print("Names and departments:")
df.select("name", "department").show()

# Count records
print(f"Total employees: {df.count()}")
Employees older than 30:
+-----+---+----------+
| name|age|department|
+-----+---+----------+
|Mukul| 35|     Sales|
| Doku| 32|   Finance|
+-----+---+----------+

Names and departments:
+-------+------------+
|   name|  department|
+-------+------------+
| Prince| Engineering|
|  Mukul|       Sales|
|Durgesh|   Marketing|
|   Doku|     Finance|
+-------+------------+

Total employees: 4

Comparison

Method Schema Control Best For
Direct createDataFrame() Automatic Quick prototyping
RDD + Schema Explicit Production environments

Conclusion

Converting a list of dictionaries to a PySpark DataFrame can be done directly using createDataFrame() or with explicit schema definition for better control. The explicit schema approach is recommended for production environments to ensure data type consistency.

Updated on: 2026-03-27T09:26:21+05:30

4K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements