Apache Drill - Fundamentals



In this chapter, we will discuss about the nested data model, JSON, Apache Avro, nested query language along with some other components in detail.

Drill Nested Data Model

Apache Drill supports various data models. The initial goal is to support the column-based format used by Dremel, then it is designed to support schema less models such as JSON, BSON (Binary JSON) and schema based models like Avro and CSV.

JSON

JSON (JavaScript Object Notation) is a lightweight text-based open standard designed for human-readable data interchange. JSON format is used for serializing and transmitting structured data over network connection. It is primarily used to transmit data between a server and web applications. JSON is typically perceived as a format whose main advantage is that it is simple and lean. It can be used without knowing or caring about any underlying schema.

Following is a basic JSON schema, which covers a classical product catalog description −

{
   "$schema": "http://json-schema.org/draft-04/schema#",
   "title": "Product",
   "description": “Classical product catalog",
   "type": "object",
   "properties": {
      "id": {
         "description": "The unique identifier for a product",
         "type": "integer"
      },
      "name": {
         "description": "Name of the product",
         "type": "string"
      },
      "price": {
         "type": "number",
         "minimum": 0,
         "exclusiveMinimum": true
      }
   },
   "required": ["id", "name", "price"]
}

The JSON Schema has the capability to express basic definitions and constraints for data types contained in objects, and it also supports some more advanced features such as properties typed as other objects, inheritance, and links.

Apache Avro

Avro is an Apache open source project that provides data serialization and data exchange services for Hadoop. These services can be used together or independently. Avro is a schema-based system. A language-independent schema is associated with its read and write operations. Using Avro, big data can be exchanged between programs written in any language. Avro supports a rich set of primitive data types including numeric, binary data and strings, and a number of complex types including arrays, maps, enumerations and records. A key feature of Avro is the robust support for data schemas that change over time.

Simple Avro Schema

Avro schema is created in JavaScript Object Notation (JSON) document format, which is a lightweight text-based data interchange format.

For example

The given schema defines a (record type) document within "AvroSample" namespace. The name of document is "Employee" which contains two "Fields" → Name and Age.

{
   " type " : "record",
   " namespace " : "AvroSample",
   " name " : "Employee",
   " fields " : [
      { "name" : " Name" , "type" : "string" },
      { "name" : "age" , "type" : "int" }
   ]
}

The above schema contains four attributes, they have been briefly described here −

  • type − Describes document type, in this case a “record"

  • namespace − Describes the name of the namespace in which the object resides

  • name − Describes the schema name

  • fields − This is an attribute array which contains the following

  • name − Describes the name of field

  • type − Describes data type of field

Nested Query Language

Apache Drill supports various query languages. The initial goal is to support the SQL-like language used by Dremel and Google BigQuery. DrQL and Mongo query languages are an examples of Drill nested query languages.

DrQL

The DrQL (Drill Query Language) is a nested query language. DrQL is SQL like query language for nested data. It is designed to support efficient column-based processing.

Mongo Query Language

The MongoDB is an open-source document database, and leading NoSQL database. MongoDB is written in C++ and it is a cross-platform, document-oriented database that provides, high performance, high availability, and easy scalability. MongoDB works on the concept of collection and documenting.

Wherein, collection is a group of MongoDB documents. It is the equivalent of an RDBMS table. A collection exists within a single database. A document is a set of key-value pairs.

Drill File Format

Drill supports various file formats such as CSV, TSV, PSV, JSON and Parquet. Wherein, “Parquet” is the special file format which helps Drill to run faster and its data representation is almost identical to Drill data representation.

Parquet

Parquet is a columnar storage format in the Hadoop ecosystem. Compared to a traditional row-oriented format, it is much more efficient in storage and has better query performance. Parquet stores binary data in a column-oriented way, where the values of each column are organized so that they are all adjacent, enabling better compression.

It has the following important characteristics −

  • Self-describing data format
  • Columnar format
  • Flexible compression options
  • Large file size

Flat Files Format

The Apache Drill allows access to structured file types and plain text files (flat files). It consists of the following types −

  • CSV files (comma-separated values)
  • TSV files (tab-separated values)
  • PSV files (pipe-separated values)

CSV file format − A CSV is a comma separated values file, which allows data to be saved in a table structured format. CSV data fields are often separated or delimited by comma (,). The following example refers to a CSV format.

firstname, age
Alice,21
Peter,34

This CSV format can be defined as follows in a drill configuration.

"formats": {
   "csv": {
      "type": "text",
      "extensions": [
         “csv2"
      ],
      "delimiter": “,”
   }
}

TSV file format − The TSV data fields are often separated or delimited by a tab and saved with an extension of “.tsv" format. The following example refers to a TSV format.

firstname age
Alice 21
Peter 34

The TSV format can be defined as follows in a drill configuration.

"tsv": {
   "type": "text",
   "extensions": [
      "tsv"
   ],
   "delimiter": “\t"
},

PSV file format − The PSV data fields are separated or delimited by a pipe (|) symbol. The following example refers to a PSV format.

firstname|age
Alice|21
Peter|34

The PSV format can be defined as follows in a drill configuration.

"formats": {
   "psv": {
      "type": "text",
      "extensions": [
         "tbl"
      ],
      "delimiter": "|"
   }
}

These PSV files are saved with an extension of “.tbl” format.

Scalable Data Sources

Managing millions of data from multiple data sources requires a great deal of planning. When creating your data model, you need to consider the key goals such as the impact on speed of processing, how you can optimize memory usage and performance, scalability when handling growing volumes of data and requests.

Apache Drill provides the flexibility to immediately query complex data in native formats, such as schema-less data, nested data, and data with rapidly evolving schemas.

Following are its key benefits

  • High-performance analysis of data in its native format including self-describing data such as Parquet, JSON files and HBase tables.

  • Direct querying of data in HBase tables without defining and maintaining a schema in the Hive metastore.

  • SQL to query and work with semi-structured/nested data, such as data from NoSQL stores like MongoDB and online REST APIs.

Drill Clients

Apache Drill can connect to the following clients −

  • Multiple interfaces such as JDBC, ODBC, C++ API, REST using JSON
  • Drill shell
  • Drill web console (http://localhost:8047)
  • BI tools such as Tableau, MicroStrategy, etc.
  • Excel
Advertisements