
- BigQuery - Home
- BigQuery - Overview
- BigQuery - Initial Setup
- BigQuery vs Local SQL Engines
- BigQuery - Google Cloud Console
- BigQuery - Google Cloud Hierarchy
- What is Dremel?
- What is BigQuery Studio?
- BigQuery - Datasets
- BigQuery - Tables
- BigQuery - Views
- BigQuery - Create Table
- BigQuery - Basic Schema Design
- BigQuery - Alter Table
- BigQuery - Copy Table
- Delete and Recover Table
- BigQuery - Populate Table
- Standard SQL vs Legacy SQL
- BigQuery - Write First Query
- BigQuery - CRUD Operations
- Partitioning & Clustering
- BigQuery - Data Types
- BigQuery - Complex Data Types
- BigQuery - STRUCT Data Type
- BigQuery - ARRAY Data Type
- BigQuery - JSON Data Type
- BigQuery - Table Metadata
- BigQuery - User-defined Functions
- Connecting to External Sources
- Integrate Scheduled Queries
- Integrate BigQuery API
- BigQuery - Integrate Airflow
- Integrate Connected Sheets
- Integrate Data Transfers
- BigQuery - Materialized View
- BigQuery - Roles & Permissions
- BigQuery - Query Optimization
- BigQuery - BI Engine
- Monitoring Usage & Performance
- BigQuery - Data Warehouse
- Challenges & Best Practices
What is Dremel?
BigQuery is not a limitless tool. In fact, it is better to avoid thinking of BigQuery as a proverbial black box. To deepen your understanding, it's necessary to get a bit "under the hood" and examine some inner workings of BigQuery's engine.
Google's Dremel: A Distributed Computing Framework
BigQuery is based on a distributed computing framework called Dremel, which Google explained in greater depth in a 2010 whitepaper: "Dremel: Interactive Analysis of Web Scale Datasets."
The whitepaper described a vision for many of the core characteristics that define modern BigQuery such as an ad-hoc query system, nearly limitless compute power, and an emphasis on processing big data (terabytes & petabytes).
How does Dremel Work?
Since Dremel was first an internal product (used within Google since 2006), it combines aspects of search and a parallel database management system.
To execute queries, Dremel used a "tree-like" structure in which each stage of the query is executed in sequential steps and multiple queries can be executed in parallel. Dremel turns the SQL queries into an execution tree.
Slots: Foundational Unit of BigQuery Executioin
Nestled under the "Query Execution" header, the authors describe the foundational unit of BigQuery execution: the slot.
- A slot is an abstraction that represents an available processing unit.
- Slots are finite, which is why any logjams within a project are often due to a lack of slot availability.
- Since the usage of slots changes depending on many factors like the volume of data processed and time of day, it's conceivable that a query that executed quickly earlier in the day may now take several minutes.
The abstraction of slots is perhaps the most applicable concept expressed in the Dremel paper; the other information is helpful to know but mostly describes earlier iterations of the BigQuery product.
BigQuery: Pricing & Usage Models
Whether you're a student practicing your first BigQuery queries or a high-powered decision maker, understanding pricing is critical in defining the limits of what you can store, access and manipulate within BigQuery.
To thoroughly understand BigQuery pricing, it is best to divide the costs into two buckets −
- Usage (BigQuery's documentation calls this "compute")
- Storage
Usage covers nearly any SQL-activity you can think of, from running a simple SELECT query to deploying ML models or writing complex user-defined functions.
For any usage-related activities, BigQuery offers the following choices −
- A pay as you go or "on-demand" model.
- A bulk slot or "capacity" model in which clients pay per slot hour.
Which Pricing Model is best for you?
When it comes to deciding between the two pricing models, it is important to consider the following factors −
- Volume of data queried
- Volume of user traffic incurred
The "on-demand" model is priced per terabyte, which means that for users with many large (multiple terabyte) tables this could be an intuitive and convenient way to track expenses.
The "capacity" or slot model is helpful for organizations or individuals that are evolving their data infrastructure and may not have a fixed amount of data that would help them calculate a reliable per-month rate. Instead of worrying about how much data each resource generates, the problem shifts to refining best practices to allocate querying time to both scheduled processes and individual, ad-hoc queries.
In essence, the slot model follows the framework established by the Dremel project, in which slots (servers) are reserved and priced accordingly.