Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to Convert a list of Dictionaries into Pyspark DataFrame?
PySpark allows you to convert Python data structures into distributed DataFrames for big data processing. Converting a list of dictionaries into a PySpark DataFrame is a common task when working with structured data in distributed computing environments.
In this tutorial, we will explore the step-by-step process of converting a list of dictionaries into a PySpark DataFrame using PySpark's DataFrame API.
Prerequisites
Before starting, ensure you have PySpark installed and a SparkSession created ?
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Create SparkSession
spark = SparkSession.builder.appName("DictToDataFrame").getOrCreate()
Sample Data
Let's create a sample list of dictionaries representing employee information ?
# Sample list of dictionaries
employee_data = [
{"name": "Prince", "age": 30, "department": "Engineering"},
{"name": "Mukul", "age": 35, "department": "Sales"},
{"name": "Durgesh", "age": 28, "department": "Marketing"},
{"name": "Doku", "age": 32, "department": "Finance"}
]
print("Original data:")
for emp in employee_data:
print(emp)
Original data:
{'name': 'Prince', 'age': 30, 'department': 'Engineering'}
{'name': 'Mukul', 'age': 35, 'department': 'Sales'}
{'name': 'Durgesh', 'age': 28, 'department': 'Marketing'}
{'name': 'Doku', 'age': 32, 'department': 'Finance'}
Method 1: Using createDataFrame() Directly
The simplest approach is to use createDataFrame() method directly on the list of dictionaries ?
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder.appName("DictToDataFrame").getOrCreate()
# Sample data
employee_data = [
{"name": "Prince", "age": 30, "department": "Engineering"},
{"name": "Mukul", "age": 35, "department": "Sales"},
{"name": "Durgesh", "age": 28, "department": "Marketing"},
{"name": "Doku", "age": 32, "department": "Finance"}
]
# Create DataFrame directly
df = spark.createDataFrame(employee_data)
df.show()
+-------+---+------------+ | name|age| department| +-------+---+------------+ | Prince| 30| Engineering| | Mukul| 35| Sales| |Durgesh| 28| Marketing| | Doku| 32| Finance| +-------+---+------------+
Method 2: Using RDD with Schema
For more control over data types, create an RDD first and then apply a custom schema ?
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Create SparkSession
spark = SparkSession.builder.appName("DictToDataFrame").getOrCreate()
# Sample data
employee_data = [
{"name": "Prince", "age": 30, "department": "Engineering"},
{"name": "Mukul", "age": 35, "department": "Sales"},
{"name": "Durgesh", "age": 28, "department": "Marketing"},
{"name": "Doku", "age": 32, "department": "Finance"}
]
# Step 1: Create RDD from list of dictionaries
rdd = spark.sparkContext.parallelize(employee_data)
# Step 2: Define schema
schema = StructType([
StructField("name", StringType(), nullable=False),
StructField("age", IntegerType(), nullable=False),
StructField("department", StringType(), nullable=False)
])
# Step 3: Create DataFrame with schema
df = spark.createDataFrame(rdd, schema)
df.show()
# Display schema information
df.printSchema()
+-------+---+------------+ | name|age| department| +-------+---+------------+ | Prince| 30| Engineering| | Mukul| 35| Sales| |Durgesh| 28| Marketing| | Doku| 32| Finance| +-------+---+------------+ root |-- name: string (nullable = false) |-- age: integer (nullable = false) |-- department: string (nullable = false)
DataFrame Operations
Once created, you can perform various operations on the DataFrame ?
from pyspark.sql import SparkSession
# Create SparkSession and DataFrame
spark = SparkSession.builder.appName("DictToDataFrame").getOrCreate()
employee_data = [
{"name": "Prince", "age": 30, "department": "Engineering"},
{"name": "Mukul", "age": 35, "department": "Sales"},
{"name": "Durgesh", "age": 28, "department": "Marketing"},
{"name": "Doku", "age": 32, "department": "Finance"}
]
df = spark.createDataFrame(employee_data)
# Filter employees older than 30
print("Employees older than 30:")
df.filter(df.age > 30).show()
# Select specific columns
print("Names and departments:")
df.select("name", "department").show()
# Count records
print(f"Total employees: {df.count()}")
Employees older than 30: +-----+---+----------+ | name|age|department| +-----+---+----------+ |Mukul| 35| Sales| | Doku| 32| Finance| +-----+---+----------+ Names and departments: +-------+------------+ | name| department| +-------+------------+ | Prince| Engineering| | Mukul| Sales| |Durgesh| Marketing| | Doku| Finance| +-------+------------+ Total employees: 4
Comparison
| Method | Schema Control | Best For |
|---|---|---|
| Direct createDataFrame() | Automatic | Quick prototyping |
| RDD + Schema | Explicit | Production environments |
Conclusion
Converting a list of dictionaries to a PySpark DataFrame can be done directly using createDataFrame() or with explicit schema definition for better control. The explicit schema approach is recommended for production environments to ensure data type consistency.
