Programmatically Specifying the Schema


Advertisements

The second method for creating DataFrame is through programmatic interface that allows you to construct a schema and then apply it to an existing RDD. We can create a DataFrame programmatically using the following three steps.

  • Create an RDD of Rows from an Original RDD.

  • Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.

  • Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.

Example

Let us consider an example of employee records in a text file named employee.txt. Create a Schema using DataFrame directly by reading the data from text file.

Given Data − Look at the following data of a file named employee.txt placed in the current respective directory where the spark shell point is running.

1201, satish, 25
1202, krishna, 28
1203, amith, 39
1204, javed, 23
1205, prudvi, 23

Follow the steps given below to generate a schema programmatically.

Open Spark Shell

Start the Spark shell using following example.

$ spark-shell

Create SQLContext Object

Generate SQLContext using the following command. Here, sc means SparkContext object.

scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)

Read Input from Text File

Create an RDD DataFrame by reading a data from the text file named employee.txt using the following command.

scala> val employee = sc.textFile("employee.txt")

Create an Encoded Schema in a String Format

Use the following command for creating an encoded schema in a string format. That means, assume the field structure of a table and pass the field names using some delimiter.

scala> val schemaString = "id name age"

Output

schemaString: String = id name age

Import Respective APIs

Use the following command to import Row capabilities and SQL DataTypes.

scala> import org.apache.spark.sql.Row;
scala> import org.apache.spark.sql.types.{StructType, StructField, StringType};

Generate Schema

The following command is used to generate a schema by reading the schemaString variable. It means you need to read each field by splitting the whole string with space as a delimiter and take each field type is String type, by default.

scala> val schema = StructType(schemaString.split(" ").map(fieldName ⇒ StructField(fieldName, StringType, true)))

Apply Transformation for Reading Data from Text File

Use the following command to convert an RDD (employee) to Rows. It means, here we are specifying the logic for reading the RDD data and store it into rowRDD. Here we are using two map functions: one is a delimiter for splitting the record string (.map(_.split(","))) and the second map function for defining a Row with the field index value (.map(e ⇒ Row(e(0).trim.toInt, e(1), e(2).trim.toInt))).

scala> val rowRDD = employee.map(_.split(",")).map(e ⇒ Row(e(0).trim.toInt, e(1), e(2).trim.toInt))

Apply RowRDD in Row Data based on Schema

Use the following statement for creating a DataFrame using rowRDD data and schema (SCHEMA) variable.

scala> val employeeDF = sqlContext.createDataFrame(rowRDD, schema)

Output

employeeDF: org.apache.spark.sql.DataFrame = [id: string, name: string, age: string]

Store DataFrame Data into Table

Use the following command to store the DataFrame into a table named employee.

scala> employeeDF.registerTempTable("employee")

The employee table is now ready. Let us pass some SQL queries into the table using the method SQLContext.sql().

Select Query on DataFrame

Use the following statement for selecting all records from the employee table. Here we use the variable allrecords for capturing all records data. To display those records, call show() method on it.

scala> val allrecords = sqlContext.sql("SELECT * FROM employee")

To see the result data of allrecords DataFrame, use the following command.

scala> allrecords.show()

Output

+------+--------+----+
|  id  | name   |age |
+------+--------+----+
| 1201 | satish | 25 |
| 1202 | krishna| 28 |
| 1203 | amith  | 39 |
| 1204 | javed  | 23 |
| 1205 | prudvi | 23 |
+------+--------+----+

The method sqlContext.sql allows you to construct DataFrames when the columns and their types are not known until runtime. Now you can run different SQL queries into it.

spark_sql_dataframes.htm
Advertisements