Article Categories

Selected Reading

How to use Matplotlib to plot PySpark SQL results?

Python Server Side Programming Programming

To use Matplotlib to plot PySpark SQL results, we need to convert Spark DataFrames to Pandas DataFrames and then use Matplotlib for visualization. This process involves setting up a Spark context, creating a DataFrame, running SQL queries, and converting results for plotting.

Setting Up the Environment

First, we need to import the required libraries and configure Matplotlib ?

from pyspark.sql import SparkSession, Row
import matplotlib.pyplot as plt

# Configure matplotlib
plt.rcParams["figure.figsize"] = [7.50, 3.50]
plt.rcParams["figure.autolayout"] = True

# Create Spark session
spark = SparkSession.builder.appName("MatplotlibExample").getOrCreate()

Creating Sample Data and DataFrame

We'll create a sample dataset and convert it to a Spark DataFrame ?

# Create sample data
test_list = [(1, 'John', 85), (2, 'James', 92), (3, 'Jack', 78), (4, 'Joe', 88)]
df = spark.createDataFrame(test_list, ['id', 'name', 'score'])

# Register as temporary view for SQL queries
df.createOrReplaceTempView("students_table")

# Show the DataFrame
df.show()

+---+-----+-----+
| id| name|score|
+---+-----+-----+
|  1| John|   85|
|  2|James|   92|
|  3| Jack|   78|
|  4|  Joe|   88|
+---+-----+-----+

Running SQL Queries and Plotting

Now we can run SQL queries and plot the results using Matplotlib ?

# Run SQL query
result_df = spark.sql("SELECT name, score FROM students_table ORDER BY score DESC")

# Convert to Pandas DataFrame for plotting
pandas_df = result_df.toPandas()

# Create the plot
plt.figure(figsize=(8, 5))
plt.bar(pandas_df['name'], pandas_df['score'], color=['blue', 'green', 'orange', 'red'])
plt.xlabel('Student Name')
plt.ylabel('Score')
plt.title('Student Scores from PySpark SQL Query')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Multiple Visualization Examples

Here's how to create different types of plots from PySpark SQL results ?

# Create more complex dataset
data = [(1, 'Math', 85, 'A'), (2, 'Science', 92, 'A+'), 
        (3, 'History', 78, 'B'), (4, 'English', 88, 'A')]

subjects_df = spark.createDataFrame(data, ['id', 'subject', 'average_score', 'grade'])
subjects_df.createOrReplaceTempView("subjects")

# Query for plotting
query_result = spark.sql("""
    SELECT subject, average_score 
    FROM subjects 
    WHERE average_score > 80
""")

# Convert to pandas and create multiple plots
plot_data = query_result.toPandas()

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Bar plot
ax1.bar(plot_data['subject'], plot_data['average_score'])
ax1.set_title('High Scoring Subjects')
ax1.set_ylabel('Average Score')

# Line plot
ax2.plot(plot_data['subject'], plot_data['average_score'], marker='o')
ax2.set_title('Score Trend')
ax2.set_ylabel('Average Score')

plt.tight_layout()
plt.show()

# Close Spark session
spark.stop()

Key Points

SparkSession − Modern entry point for Spark functionality (replaces SparkContext)
toPandas() − Converts Spark DataFrame to Pandas DataFrame for plotting
createOrReplaceTempView() − Creates temporary SQL view for querying
spark.sql() − Executes SQL queries on registered views

Conclusion

Converting PySpark SQL results to Pandas DataFrames enables seamless integration with Matplotlib. Use toPandas() to convert Spark results and leverage Matplotlib's full visualization capabilities for your data analysis.

Rishikesh Kumar Rishi

Updated on: 2026-03-25T23:41:21+05:30

4K+ Views

Previous Next