Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to use Matplotlib to plot PySpark SQL results?
To use Matplotlib to plot PySpark SQL results, we need to convert Spark DataFrames to Pandas DataFrames and then use Matplotlib for visualization. This process involves setting up a Spark context, creating a DataFrame, running SQL queries, and converting results for plotting.
Setting Up the Environment
First, we need to import the required libraries and configure Matplotlib ?
from pyspark.sql import SparkSession, Row
import matplotlib.pyplot as plt
# Configure matplotlib
plt.rcParams["figure.figsize"] = [7.50, 3.50]
plt.rcParams["figure.autolayout"] = True
# Create Spark session
spark = SparkSession.builder.appName("MatplotlibExample").getOrCreate()
Creating Sample Data and DataFrame
We'll create a sample dataset and convert it to a Spark DataFrame ?
# Create sample data
test_list = [(1, 'John', 85), (2, 'James', 92), (3, 'Jack', 78), (4, 'Joe', 88)]
df = spark.createDataFrame(test_list, ['id', 'name', 'score'])
# Register as temporary view for SQL queries
df.createOrReplaceTempView("students_table")
# Show the DataFrame
df.show()
+---+-----+-----+ | id| name|score| +---+-----+-----+ | 1| John| 85| | 2|James| 92| | 3| Jack| 78| | 4| Joe| 88| +---+-----+-----+
Running SQL Queries and Plotting
Now we can run SQL queries and plot the results using Matplotlib ?
# Run SQL query
result_df = spark.sql("SELECT name, score FROM students_table ORDER BY score DESC")
# Convert to Pandas DataFrame for plotting
pandas_df = result_df.toPandas()
# Create the plot
plt.figure(figsize=(8, 5))
plt.bar(pandas_df['name'], pandas_df['score'], color=['blue', 'green', 'orange', 'red'])
plt.xlabel('Student Name')
plt.ylabel('Score')
plt.title('Student Scores from PySpark SQL Query')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Multiple Visualization Examples
Here's how to create different types of plots from PySpark SQL results ?
# Create more complex dataset
data = [(1, 'Math', 85, 'A'), (2, 'Science', 92, 'A+'),
(3, 'History', 78, 'B'), (4, 'English', 88, 'A')]
subjects_df = spark.createDataFrame(data, ['id', 'subject', 'average_score', 'grade'])
subjects_df.createOrReplaceTempView("subjects")
# Query for plotting
query_result = spark.sql("""
SELECT subject, average_score
FROM subjects
WHERE average_score > 80
""")
# Convert to pandas and create multiple plots
plot_data = query_result.toPandas()
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# Bar plot
ax1.bar(plot_data['subject'], plot_data['average_score'])
ax1.set_title('High Scoring Subjects')
ax1.set_ylabel('Average Score')
# Line plot
ax2.plot(plot_data['subject'], plot_data['average_score'], marker='o')
ax2.set_title('Score Trend')
ax2.set_ylabel('Average Score')
plt.tight_layout()
plt.show()
# Close Spark session
spark.stop()
Key Points
- SparkSession − Modern entry point for Spark functionality (replaces SparkContext)
- toPandas() − Converts Spark DataFrame to Pandas DataFrame for plotting
- createOrReplaceTempView() − Creates temporary SQL view for querying
- spark.sql() − Executes SQL queries on registered views
Conclusion
Converting PySpark SQL results to Pandas DataFrames enables seamless integration with Matplotlib. Use toPandas() to convert Spark results and leverage Matplotlib's full visualization capabilities for your data analysis.
