BigQuery - Overview



BigQuery is Google Cloud Platform's (GCP) structured query language (SQL) engine. BigQuery allows users to query, create and manipulate datasets instantly using serverless cloud infrastructure. Consequently, students, professionals and organizations gain the ability to store and analyze data at a nearly infinite scale.

Like several other Google initiatives, BigQuery began as an in-house tool for Google developers to process and analyze large datasets. Googlers have been using BigQuery and its predecessor, Dremel, since 2006. Finding success within Google, GCP first released BigQuery as a beta in the year 2010 and then made the tool widely available in 2011.

How did BigQuery Edge Over Its Competitors?

Although there were many SQL engines and integrated development environments (IDEs) in the market at the time of BigQuery's release, BigQuery leveraged several competitive advantages, among them −

  • Pioneering a slot-based query system, which could automatically allocate compute power or "slots" based on user demand
  • Offering API integrations with a variety of programming languages from Python to JavaScript
  • Providing users the ability to train, run and deploy machine learning models written in SQL
  • Integrating seamlessly with visualization platforms like Google Data Studio (deprecated) and Looker
  • The ability to store and query ARRAY and STRUCT types, both complex SQL data types

To assist students and professionals, BigQuery maintains robust documentation about functions, public datasets and available integrations.

Google BigQuery Use Cases

Wielded properly, Google BigQuery can be used as a personal or organizational data warehouse. However, since BigQuery is SQL-oriented, it is best used with sources that feature structured, not unstructured, data.

Google's cloud architecture facilitates reliable storage and enables nearly infinite scalability. This makes BigQuery a popular enterprise choice, especially for organizations that generate or ingest large volumes of data.

To truly understand the power of BigQuery, it is helpful to understand how use cases differ between the various data team roles.

  • Data Analyst − Employs BigQuery in conjunction with data visualization tools like Tableau or Looker to create top-line reports for organizational leadership.
  • Data Scientist − Leverages BigQuery's machine learning capabilities, Big ML, to create and implement ML models.
  • Data Engineer − Uses BigQuery as a data warehouse and builds tools like views and user defined functions to empower end users to discover insights in cleaned, preprocessed datasets.

When used in tandem, BigQuery presents an opportunity to optimize storage and generate nuanced insights to increase growth and revenue.

Loading Data in BigQuery

Loading data in BigQuery is simple and intuitive with a variety of options available to cloud developers. BigQuery can ingest data in many forms including −

  • CSV
  • JSON
  • Parquet

BigQuery can also sync with Google integrations like Google Sheets to create live connected tables. And BigQuery can load files stored in cloud storage repositories like Google Cloud Storage.

Developers can also leverage the following methods to cleanly load data into BigQuery −

  • BigQuery Transfers, a code-less data pipeline that integrates with GCP data like Google Analytics.
  • SQL-based pipelines that feature CRUD statements to CREATE, DELETE or DROP tables.
  • Pipelines that utilize the BigQuery API to conduct batch or streaming load jobs.
  • Third-party pipelines like Fivetran that offer BigQuery connections.

Routinely loading data from the above sources enables downstream stakeholders to have reliable access to timely, accurate data.

BigQuery Querying Basics

While it is helpful to understand BigQuery use-cases and how to properly load data, beginners learning BigQuery will most likely start at the query layer.

Table Reference Convention in BigQuery

Before writing your first SELECT statement, it is helpful to understand table reference convention in BigQuery.

BigQuery syntax differs from other SQL dialects because proper BigQuery SQL requires developers to enclose table references in single quotes like this − ' '.

A table reference involves three elements −

  • Project
  • Dataset
  • Table

Put together, a BigQuery table looks like this when referenced in a SQL query or script −

'my_project.dataset.table'

A basic BigQuery query looks like this −

SELECT * FROM 'my_project.dataset.table.'

Legacy SQL and Standard SQL

An important distinction between BigQuery SQL and other types of SQL is that BigQuery offers support for two classifications of SQL: Legacy and Standard.

  • Legacy SQL enables users to use older, possibly deprecated SQL functions.
  • Standard SQL represents a more updated interpretation of SQL.

For our purposes, we'll be using the standard SQL designation.

Points to Note When Writing a Query in BigQuery

Unlike other SQL environments, when you write a query in BigQuery, the UI will automatically tell you whether your query will run and how much data it will process. Note that there is little definitive correlation between volume of data processed and projected execution time.

  • Once executed, your query will return a result visible in the UI, which is similar to viewing a spreadsheet or a Pandas data frame.
  • You also have the option to download your result as either a CSV or JSON file, which both provide a simple way to save data to work with later.
  • If you need to save query text to run or edit later, BigQuery also provides an option to save queries with version control to help track changes.

Running BigQuery at the Enterprise Level

When running queries in an organization with several BigQuery users, it is important to keep in mind several factors to ensure your queries run without interruption or without causing congestion for other users −

  • Avoid executing queries at peak usage times.
  • Whenever possible, use a WHERE clause to narrow the amount of data processed.
  • For large tables, create views users can more easily and efficiently query.

Transitioning from learning BigQuery to running queries at the enterprise level requires time to develop an understanding of variables particular to your organization like slot usage and compute resource consumption.

First, however, it is necessary to develop a deeper understanding of constructing and optimizing queries in BigQuery.

Advertisements