Spark SQL - Parquet Files

Parquet is a columnar format, supported by many data processing systems. The advantages of having a columnar storage are as follows −

Columnar storage limits IO operations.
Columnar storage can fetch specific columns that you need to access.
Columnar storage consumes less space.
Columnar storage gives better-summarized data and follows type-specific encoding.

Spark SQL provides support for both reading and writing parquet files that automatically capture the schema of the original data. Like JSON datasets, parquet files follow the same procedure.

Let’s take another look at the same example of employee record data named employee.parquet placed in the same directory where spark-shell is running.

Given data − Do not bother about converting the input data of employee records into parquet format. We use the following commands that convert the RDD data into Parquet file. Place the employee.json document, which we have used as the input file in our previous examples.

$ spark-shell
Scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Scala> val employee = sqlContext.read.json(“emplaoyee”)
Scala> employee.write.parquet(“employee.parquet”)

It is not possible to show you the parquet file. It is a directory structure, which you can find in the current directory. If you want to see the directory and file structure, use the following command.

$ cd employee.parquet/

$ ls
_common_metadata
Part-r-00001.gz.parquet
_metadata
_SUCCESS

The following commands are used for reading, registering into table, and applying some queries on it.

Open Spark Shell

Start the Spark shell using following example

$ spark-shell

Create SQLContext Object

Generate SQLContext using the following command. Here, sc means SparkContext object.

scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)

Read Input from Text File

Create an RDD DataFrame by reading a data from the parquet file named employee.parquet using the following statement.

scala> val parqfile = sqlContext.read.parquet(“employee.parquet”)

Store the DataFrame into the Table

Use the following command for storing the DataFrame data into a table named employee. After this command, we can apply all types of SQL statements into it.

scala> Parqfile.registerTempTable(“employee”)

The employee table is ready. Let us now pass some SQL queries on the table using the method SQLContext.sql().

Select Query on DataFrame

Use the following command for selecting all records from the employee table. Here, we use the variable allrecords for capturing all records data. To display those records, call show() method on it.

scala> val allrecords = sqlContext.sql("SELeCT * FROM employee")

To see the result data of allrecords DataFrame, use the following command.

scala> allrecords.show()

Output

+------+--------+----+
|  id  | name   |age |
+------+--------+----+
| 1201 | satish | 25 |
| 1202 | krishna| 28 |
| 1203 | amith  | 39 |
| 1204 | javed  | 23 |
| 1205 | prudvi | 23 |
+------+--------+----+

spark_sql_data_sources.htm