How to Convert Pandas to PySpark DataFrame?

Python Server Side Programming Programming Pandas

Pandas and PySpark are two popular data processing tools in Python. While Pandas is well-suited for working with small to medium-sized datasets on a single machine, PySpark is designed for distributed processing of large datasets across multiple machines.

Converting a pandas DataFrame to a PySpark DataFrame can be necessary when you need to scale up your data processing to handle larger datasets. In this guide, we'll explore the process of converting a pandas DataFrame to a PySpark DataFrame using the PySpark library in Python.

We'll cover the steps involved in installing and setting up PySpark, converting a pandas DataFrame to a PySpark DataFrame, and some common operations you can perform on PySpark DataFrames.

The syntax for creating a PySpark DataFrame using the createDataFrame() method is as follows:

spark.createDataFrame(data, schema)

Here, data is the list of values on which the DataFrame is created, and schema is either the structure of the dataset or a list of column names. The spark parameter refers to the SparkSession object in PySpark.

Using the spark.createDataFrame() Method

Here's an example code that demonstrates how to create a pandas DataFrame and then convert it to a PySpark DataFrame using the spark.createDataFrame() method.

Consider the code shown below. In this code, we first create a sample pandas DataFrame called df_pandas. We then create a SparkSession object using the SparkSession.builder method, which allows us to work with PySpark.

Next, we use the createDataFrame() method provided by the spark object to convert our pandas DataFrame to a PySpark DataFrame. The createDataFrame() method takes the pandas DataFrame as its input and returns a new PySpark DataFrame object.

Finally, we use the show() method to display the contents of the PySpark DataFrame to the console.

import pandas as pd
from pyspark.sql import SparkSession

# Create a sample pandas DataFrame
data = {'Name': ['John', 'Jane', 'Bob'],
   'Age': [30, 25, 40],
   'Salary': [50000.0, 60000.0, 70000.0]}
df_pandas = pd.DataFrame(data)

# Create a SparkSession object
spark = SparkSession.builder.appName('PandasToSparkDF').getOrCreate()

# Convert pandas DataFrame to PySpark DataFrame
df_spark = spark.createDataFrame(df_pandas)

# Show the PySpark DataFrame
df_spark.show()

Before running the above code, make sure that you have the Pandas and PySpark libraries installed on your system.

Output

On execution, it will produce the following output:

+----+---+-------+
|Name|Age| Salary|
+----+---+-------+
|John| 30|50000.0|
|Jane| 25|60000.0|
| Bob| 40|70000.0|
+----+---+-------+

Using ArrowSpark

Here's an updated code that demonstrates how to use Apache Arrow to improve the performance of converting a Pandas DataFrame to a PySpark DataFrame.

Consider the code shown below. In this code, we first create a sample pandas DataFrame called df_pandas. We then use the PyArrow library to convert the pandas DataFrame to a PyArrow Table using the Table.from_pandas() method.

Next, we write the PyArrow Table to disk in Parquet format using the pq.write_table() method. This creates a file called data.parquet in the current directory.

Finally, we use the spark.read.parquet() method to read the Parquet file into a PySpark DataFrame called df_spark. We can then use the show() method to display the contents of the PySpark DataFrame to the console.

Using Apache Arrow and Parquet format to convert data between Pandas and PySpark can improve performance by reducing data serialization overhead and enabling efficient columnar storage.

import pandas as pd
from pyspark.sql import SparkSession
import pyarrow as pa
import pyarrow.parquet as pq

# Create a sample pandas DataFrame
data = {'Name': ['John', 'Jane', 'Bob'],
   'Age': [30, 25, 40],
   'Salary': [50000.0, 60000.0, 70000.0]}
df_pandas = pd.DataFrame(data)

# Convert pandas DataFrame to PyArrow Table
table = pa.Table.from_pandas(df_pandas)

# Write the PyArrow Table to Parquet format
pq.write_table(table, 'data.parquet')

# Create a SparkSession object
spark = SparkSession.builder.appName('PandasToSparkDF').getOrCreate()

# Read the Parquet file into a PySpark DataFrame
df_spark = spark.read.parquet('data.parquet')

# Show the PySpark DataFrame
df_spark.show()

To run the above code, we first need to install the pyarrow library in our machine, and for that we can make use of the command shown below.

pip3 install pyarrow

Output

On execution, it will produce the following output:

+-----+---+
| Name|Age|
+-----+---+
|John | 30|
|Jane | 25|
|  Bob| 40|
+-----+---+

Conclusion

In conclusion, converting a Pandas DataFrame to a PySpark DataFrame can be done using PyArrow to convert the Pandas DataFrame to a PyArrow Table and writing it to disk in Parquet format. The resulting Parquet file can then be read into a PySpark DataFrame.

PySpark provides a powerful distributed computing framework that can handle large-scale data processing, making it an ideal choice for big data analysis. By using the above methods to convert Pandas DataFrames to PySpark DataFrames, users can take advantage of both the powerful features of PySpark and the convenience of working with Pandas DataFrames.

Mukul Latiyan

Updated on: 2023-04-18T14:51:05+05:30

6K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started