What is Dremel?



BigQuery is not a limitless tool. In fact, it is better to avoid thinking of BigQuery as a proverbial black box. To deepen your understanding, it's necessary to get a bit "under the hood" and examine some inner workings of BigQuery's engine.

Google's Dremel: A Distributed Computing Framework

BigQuery is based on a distributed computing framework called Dremel, which Google explained in greater depth in a 2010 whitepaper: "Dremel: Interactive Analysis of Web Scale Datasets."

The whitepaper described a vision for many of the core characteristics that define modern BigQuery such as an ad-hoc query system, nearly limitless compute power, and an emphasis on processing big data (terabytes & petabytes).

How does Dremel Work?

Since Dremel was first an internal product (used within Google since 2006), it combines aspects of search and a parallel database management system.

To execute queries, Dremel used a "tree-like" structure in which each stage of the query is executed in sequential steps and multiple queries can be executed in parallel. Dremel turns the SQL queries into an execution tree.

Slots: Foundational Unit of BigQuery Executioin

Nestled under the "Query Execution" header, the authors describe the foundational unit of BigQuery execution: the slot.

  • A slot is an abstraction that represents an available processing unit.
  • Slots are finite, which is why any logjams within a project are often due to a lack of slot availability.
  • Since the usage of slots changes depending on many factors like the volume of data processed and time of day, it's conceivable that a query that executed quickly earlier in the day may now take several minutes.

The abstraction of slots is perhaps the most applicable concept expressed in the Dremel paper; the other information is helpful to know but mostly describes earlier iterations of the BigQuery product.

BigQuery: Pricing & Usage Models

Whether you're a student practicing your first BigQuery queries or a high-powered decision maker, understanding pricing is critical in defining the limits of what you can store, access and manipulate within BigQuery.

To thoroughly understand BigQuery pricing, it is best to divide the costs into two buckets −

  • Usage (BigQuery's documentation calls this "compute")
  • Storage

Usage covers nearly any SQL-activity you can think of, from running a simple SELECT query to deploying ML models or writing complex user-defined functions.

For any usage-related activities, BigQuery offers the following choices −

  • A pay as you go or "on-demand" model.
  • A bulk slot or "capacity" model in which clients pay per slot hour.

Which Pricing Model is best for you?

When it comes to deciding between the two pricing models, it is important to consider the following factors −

  • Volume of data queried
  • Volume of user traffic incurred

The "on-demand" model is priced per terabyte, which means that for users with many large (multiple terabyte) tables this could be an intuitive and convenient way to track expenses.

The "capacity" or slot model is helpful for organizations or individuals that are evolving their data infrastructure and may not have a fixed amount of data that would help them calculate a reliable per-month rate. Instead of worrying about how much data each resource generates, the problem shifts to refining best practices to allocate querying time to both scheduled processes and individual, ad-hoc queries.

In essence, the slot model follows the framework established by the Dremel project, in which slots (servers) are reserved and priced accordingly.

Advertisements