How to use Matplotlib to plot PySpark SQL results?


To use Matplotlib to plot PySpark SQL results, we can take the following steps−

  • Set the figure size and adjust the padding between and around the subplots.
  • Get the instance that is the main Entry Point for Spark functionality.
  • Get the instance of a variant of Spark SQL that integrates with the data stored in Hive.
  • Make a list of records as a tuple.
  • Distribute a local Python collection to form an RDD.
  • Map the list record as a DB schema.
  • Get the schema instance to make an entry into "my_table".
  • Insert a record into a table.
  • Read the SQL query, retrieve the record.
  • Convert the fetched record into a data frame.
  • Set the index with name attribute and plot them.
  • To display the figure, use show() method.

Example

from pyspark.sql import Row
from pyspark.sql import HiveContext
import pyspark
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = [7.50, 3.50]
plt.rcParams["figure.autolayout"] = True

sc = pyspark.SparkContext()
sqlContext = HiveContext(sc)

test_list = [(1, 'John'), (2, 'James'), (3, 'Jack'), (4, 'Joe')]
rdd = sc.parallelize(test_list)
people = rdd.map(lambda x: Row(id=int(x[0]), name=x[1]))
schemaPeople = sqlContext.createDataFrame(people)
sqlContext.registerDataFrameAsTable(schemaPeople, "my_table")

df = sqlContext.sql("Select * from my_table")
df = df.toPandas()
df.set_index('name').plot()

plt.show()

Output

Updated on: 07-Jul-2021

3K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements