Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to Convert Pandas to PySpark DataFrame?
Pandas and PySpark are two popular data processing tools in Python. While Pandas is well-suited for working with small to medium-sized datasets on a single machine, PySpark is designed for distributed processing of large datasets across multiple machines.
Converting a pandas DataFrame to a PySpark DataFrame becomes necessary when you need to scale up your data processing to handle larger datasets. This guide explores two main approaches for converting pandas DataFrames to PySpark DataFrames.
Syntax
The basic syntax for creating a PySpark DataFrame is ?
spark.createDataFrame(data, schema)
Here, data is the pandas DataFrame or list of values, and schema is optional either the structure of the dataset or a list of column names. The spark parameter refers to the SparkSession object.
Method 1: Using spark.createDataFrame()
The most straightforward approach is using the createDataFrame() method. First, create a SparkSession, then convert the pandas DataFrame directly ?
import pandas as pd
from pyspark.sql import SparkSession
# Create a sample pandas DataFrame
data = {'Name': ['John', 'Jane', 'Bob'],
'Age': [30, 25, 40],
'Salary': [50000.0, 60000.0, 70000.0]}
df_pandas = pd.DataFrame(data)
# Create a SparkSession object
spark = SparkSession.builder.appName('PandasToSparkDF').getOrCreate()
# Convert pandas DataFrame to PySpark DataFrame
df_spark = spark.createDataFrame(df_pandas)
# Show the PySpark DataFrame
df_spark.show()
+----+---+-------+ |Name|Age| Salary| +----+---+-------+ |John| 30|50000.0| |Jane| 25|60000.0| | Bob| 40|70000.0| +----+---+-------+
Method 2: Using PyArrow for Better Performance
For larger datasets, using Apache Arrow improves performance by reducing serialization overhead and enabling efficient columnar storage ?
import pandas as pd
from pyspark.sql import SparkSession
import pyarrow as pa
import pyarrow.parquet as pq
# Create a sample pandas DataFrame
data = {'Name': ['John', 'Jane', 'Bob'],
'Age': [30, 25, 40],
'Salary': [50000.0, 60000.0, 70000.0]}
df_pandas = pd.DataFrame(data)
# Convert pandas DataFrame to PyArrow Table
table = pa.Table.from_pandas(df_pandas)
# Write the PyArrow Table to Parquet format
pq.write_table(table, 'data.parquet')
# Create a SparkSession object
spark = SparkSession.builder.appName('PandasToSparkDF').getOrCreate()
# Read the Parquet file into a PySpark DataFrame
df_spark = spark.read.parquet('data.parquet')
# Show the PySpark DataFrame
df_spark.show()
Note: Install PyArrow using pip install pyarrow before running this code.
+----+---+-------+ |Name|Age| Salary| +----+---+-------+ |John| 30|50000.0| |Jane| 25|60000.0| | Bob| 40|70000.0| +----+---+-------+
Comparison
| Method | Performance | Memory Usage | Best For |
|---|---|---|---|
createDataFrame() |
Good for small data | Higher | Quick conversions, small datasets |
| PyArrow + Parquet | Better for large data | Lower | Large datasets, repeated operations |
Key Points
Use createDataFrame() for simple, one-time conversions of small datasets
Use PyArrow approach for better performance with large datasets
Always create a SparkSession before attempting conversions
PyArrow requires separate installation but provides significant performance benefits
Conclusion
Converting pandas DataFrames to PySpark DataFrames enables scaling from single-machine to distributed processing. Use createDataFrame() for simplicity or PyArrow for better performance with larger datasets.
