Spark SQL - JSON Datasets


Advertisements

Spark SQL can automatically capture the schema of a JSON dataset and load it as a DataFrame. This conversion can be done using SQLContext.read.json() on either an RDD of String or a JSON file.

Spark SQL provides an option for querying JSON data along with auto-capturing of JSON schemas for both reading and writing data. Spark SQL understands the nested fields in JSON data and allows users to directly access these fields without any explicit transformations.

Example

Let us consider an example of employee records in a text file named employee.json. Use the following commands to create a DataFrame (df).

Read a JSON document named employee.json with the following content and generate a table based on the schema in the JSON document.

employee.json − Place this file into the directory where the current scala> pointer is located.

{
   {"id" : "1201", "name" : "satish", "age" : "25"}
   {"id" : "1202", "name" : "krishna", "age" : "28"}
   {"id" : "1203", "name" : "amith", "age" : "39"}
   {"id" : "1204", "name" : "javed", "age" : "23"}
   {"id" : "1205", "name" : "prudvi", "age" : "23"}
}

Let us perform some Data Frame operations on given data.

DataFrame Operations

DataFrame provides a domain-specific language for structured data manipulation. Here we include some basic examples of structured data processing using DataFrames.

Follow the steps given below to perform DataFrame operations −

Read JSON Document

First of all, we have to read the JSON document. Based on that, generate a DataFrame named dfs.

Use the following command to read the JSON document named employee.json containing the fields − id, name, and age. It creates a DataFrame named dfs.

scala> val dfs = sqlContext.read.json("employee.json")

Output − The field names are automatically taken from employee.json.

dfs: org.apache.spark.sql.DataFrame = [age: string, id: string, name: string]

Use printSchema Method

If you want to see the Structure (Schema) of the DataFrame, then use the following command.

scala> dfs.printSchema()

Output

root
   |-- age: string (nullable = true)
   |-- id: string (nullable = true)
   |-- name: string (nullable = true)

Show the data

If you want to show the data in the DataFrame, then use the following command.

scala> dfs.show()

Output − You can see the employee data in a tabular format.

<console>:22, took 0.052610 s
+----+------+--------+
|age |  id  |  name  |
+----+------+--------+
| 25 | 1201 | satish |
| 28 | 1202 | krishna|
| 39 | 1203 | amith  |
| 23 | 1204 | javed  |
| 23 | 1205 | prudvi |
+----+------+--------+

Then we can run different SQL statements in it. Users can migrate data into JSON format with minimal effort, regardless of the origin of the data source.

spark_sql_data_sources.htm
Advertisements