Apache Drill - Introduction



In this chapter, we will discuss about the basic overview of Apache Drill, its benefits and key features. Apart from this, we will also get some basic knowledge on Google Dremel.

Overview of Google Dremel/BigQuery

Google manages big data every second of every day to provide services like Search, YouTube, Gmail and Google Docs. Google uses an efficient technology to scan big data at a blazing speed which is called as “Dremel”. Well, Dremel is a query service that allows you to run SQL-like queries against very large data sets and return accurate results in seconds.

Dremel can scan 35 billion rows without an index within ten-seconds. Dremel stores data in a columnar storage model, which means that it separates a record into column values and then stores each value on a different storage volume. But at the same time, traditional databases store the whole record on one volume. This columnar approach is the main reason that it makes Dremel drastically fast.

Google has been using Dremel in production since year 2006 and has been continuously evolving it for the applications like Spam analysis, Debugging of map tiles on Google Maps, etc. For this reason, Drill is inspired by Dremel. Recently, Google released BigQuery and it is the public implementation of Dremel that was launched for general businesses or developers to use.

What is Drill?

Apache Drill is a low latency schema-free query engine for big data. Drill uses a JSON document model internally which allows it to query data of any structure. Drill works with a variety of non-relational data stores, including Hadoop, NoSQL databases (MongoDB, HBase) and cloud storage like Amazon S3, Azure Blob Storage, etc. Users can query the data using a standard SQL and BI Tools, which doesn’t require to create and manage schemas.

Benefits

Following are some of the most important benefits of Apache Drill −

  • Drill can scale data from a single node to thousands of nodes and query petabytes of data within seconds.

  • Drill supports user defined functions.

  • Drill's symmetrical architecture and simple installation makes it easy to deploy and operate very large clusters.

  • Drill has flexible data model and extensible architecture.

  • Drill columnar execution model performs SQL processing on complex data without flattening into rows.

  • Supports large datasets

Key Features

Following are some of the most significant key features of Apache Drill −

  • Drill’s pluggable architecture enables connectivity to multiple datastores.

  • Drill has a distributed execution engine for processing queries. Users can submit requests to any node in the cluster.

  • Drill supports complex/multi-structured data types.

  • Drill uses self-describing data where a schema is specified as a part of the data itself, so no need for centralized schema definitions or management.

  • Flexible deployment options either local node or cluster.

  • Specialized memory management that reduces the amount of main memory that a program uses or references while running and eliminates garbage collections.

  • Decentralized data management.

Use Cases

Apache Drill can work along with a few other softwares, some of which are −

  • Cloud JSON and Sensor Analytics − Drill’s columnar approach leverages to access JSON data and expose those data via REST API to apply sensor analytics information.

  • Works well with Hive − Apache Drill serves as a complement to Hive deployments with low latency queries. Drill’s hive metastore integration exposes existing datasets at interactive speeds.

  • SQL for NoSQL − Drill’s ODBC driver and powerful parallelization capabilities provide interactive query capabilities.

Need for Drill

Apache Drill comes with a flexible JSON-like data model to natively query and process complex/multi-structured data. The data does not need to be flattened or transformed either at the design time or runtime, which provides high performance for queries. Drill exposes an easy and high performance Java API to build custom functions. Apache Drill is built to scale to big data needs and is not restricted by memory available on the cluster nodes.

Drill Integration

Drill has to integrate with a variety of data stores like relational data stores or non-relational data stores. It has the flexibility to add new data stores.

Integration with File Systems

  • Traditional file system − Local files and NAS (Network Attached Storage)

  • Hadoop − HDFS and MAPR-FS (MAPR-File System)

  • Cloud storage − Amazon S3, Google Cloud Storage, Azure Blob Storage

Integration with NoSQL Databases

  • MongoDB
  • HBase
  • HIVE
  • MapR-DB
Advertisements