
- Apache Drill Tutorial
- Apache Drill - Home
- Apache Drill - Introduction
- Apache Drill - Fundamentals
- Apache Drill - Architecture
- Apache Drill - Installation
- Apache Drill - SQL Operations
- Apache Drill - Query using JSON
- Window Functions using JSON
- Querying Complex Data
- Data Definition Statements
- Apache Drill - Querying Data
- Querying Data using HBase
- Querying Data using Hive
- Apache Drill - Querying Parquet Files
- Apache Drill - JDBC Interface
- Apache Drill - Custom Function
- Apache Drill - Contributors
- Apache Drill Useful Resources
- Apache Drill - Quick Guide
- Apache Drill - Useful Resources
- Apache Drill - Discussion
Apache Drill - Fundamentals
In this chapter, we will discuss about the nested data model, JSON, Apache Avro, nested query language along with some other components in detail.
Drill Nested Data Model
Apache Drill supports various data models. The initial goal is to support the column-based format used by Dremel, then it is designed to support schema less models such as JSON, BSON (Binary JSON) and schema based models like Avro and CSV.
JSON
JSON (JavaScript Object Notation) is a lightweight text-based open standard designed for human-readable data interchange. JSON format is used for serializing and transmitting structured data over network connection. It is primarily used to transmit data between a server and web applications. JSON is typically perceived as a format whose main advantage is that it is simple and lean. It can be used without knowing or caring about any underlying schema.
Following is a basic JSON schema, which covers a classical product catalog description −
{ "$schema": "http://json-schema.org/draft-04/schema#", "title": "Product", "description": “Classical product catalog", "type": "object", "properties": { "id": { "description": "The unique identifier for a product", "type": "integer" }, "name": { "description": "Name of the product", "type": "string" }, "price": { "type": "number", "minimum": 0, "exclusiveMinimum": true } }, "required": ["id", "name", "price"] }
The JSON Schema has the capability to express basic definitions and constraints for data types contained in objects, and it also supports some more advanced features such as properties typed as other objects, inheritance, and links.
Apache Avro
Avro is an Apache open source project that provides data serialization and data exchange services for Hadoop. These services can be used together or independently. Avro is a schema-based system. A language-independent schema is associated with its read and write operations. Using Avro, big data can be exchanged between programs written in any language. Avro supports a rich set of primitive data types including numeric, binary data and strings, and a number of complex types including arrays, maps, enumerations and records. A key feature of Avro is the robust support for data schemas that change over time.
Simple Avro Schema
Avro schema is created in JavaScript Object Notation (JSON) document format, which is a lightweight text-based data interchange format.
For example
The given schema defines a (record type) document within "AvroSample" namespace. The name of document is "Employee" which contains two "Fields" → Name and Age.
{ " type " : "record", " namespace " : "AvroSample", " name " : "Employee", " fields " : [ { "name" : " Name" , "type" : "string" }, { "name" : "age" , "type" : "int" } ] }
The above schema contains four attributes, they have been briefly described here −
type − Describes document type, in this case a “record"
namespace − Describes the name of the namespace in which the object resides
name − Describes the schema name
fields − This is an attribute array which contains the following
name − Describes the name of field
type − Describes data type of field
Nested Query Language
Apache Drill supports various query languages. The initial goal is to support the SQL-like language used by Dremel and Google BigQuery. DrQL and Mongo query languages are an examples of Drill nested query languages.
DrQL
The DrQL (Drill Query Language) is a nested query language. DrQL is SQL like query language for nested data. It is designed to support efficient column-based processing.
Mongo Query Language
The MongoDB is an open-source document database, and leading NoSQL database. MongoDB is written in C++ and it is a cross-platform, document-oriented database that provides, high performance, high availability, and easy scalability. MongoDB works on the concept of collection and documenting.
Wherein, collection is a group of MongoDB documents. It is the equivalent of an RDBMS table. A collection exists within a single database. A document is a set of key-value pairs.
Drill File Format
Drill supports various file formats such as CSV, TSV, PSV, JSON and Parquet. Wherein, “Parquet” is the special file format which helps Drill to run faster and its data representation is almost identical to Drill data representation.
Parquet
Parquet is a columnar storage format in the Hadoop ecosystem. Compared to a traditional row-oriented format, it is much more efficient in storage and has better query performance. Parquet stores binary data in a column-oriented way, where the values of each column are organized so that they are all adjacent, enabling better compression.
It has the following important characteristics −
- Self-describing data format
- Columnar format
- Flexible compression options
- Large file size
Flat Files Format
The Apache Drill allows access to structured file types and plain text files (flat files). It consists of the following types −
- CSV files (comma-separated values)
- TSV files (tab-separated values)
- PSV files (pipe-separated values)
CSV file format − A CSV is a comma separated values file, which allows data to be saved in a table structured format. CSV data fields are often separated or delimited by comma (,). The following example refers to a CSV format.
firstname, age Alice,21 Peter,34
This CSV format can be defined as follows in a drill configuration.
"formats": { "csv": { "type": "text", "extensions": [ “csv2" ], "delimiter": “,” } }
TSV file format − The TSV data fields are often separated or delimited by a tab and saved with an extension of “.tsv" format. The following example refers to a TSV format.
firstname age Alice 21 Peter 34
The TSV format can be defined as follows in a drill configuration.
"tsv": { "type": "text", "extensions": [ "tsv" ], "delimiter": “\t" },
PSV file format − The PSV data fields are separated or delimited by a pipe (|) symbol. The following example refers to a PSV format.
firstname|age Alice|21 Peter|34
The PSV format can be defined as follows in a drill configuration.
"formats": { "psv": { "type": "text", "extensions": [ "tbl" ], "delimiter": "|" } }
These PSV files are saved with an extension of “.tbl” format.
Scalable Data Sources
Managing millions of data from multiple data sources requires a great deal of planning. When creating your data model, you need to consider the key goals such as the impact on speed of processing, how you can optimize memory usage and performance, scalability when handling growing volumes of data and requests.
Apache Drill provides the flexibility to immediately query complex data in native formats, such as schema-less data, nested data, and data with rapidly evolving schemas.
Following are its key benefits
High-performance analysis of data in its native format including self-describing data such as Parquet, JSON files and HBase tables.
Direct querying of data in HBase tables without defining and maintaining a schema in the Hive metastore.
SQL to query and work with semi-structured/nested data, such as data from NoSQL stores like MongoDB and online REST APIs.
Drill Clients
Apache Drill can connect to the following clients −
- Multiple interfaces such as JDBC, ODBC, C++ API, REST using JSON
- Drill shell
- Drill web console (http://localhost:8047)
- BI tools such as Tableau, MicroStrategy, etc.
- Excel