BigQuery - Datasets

Quiz

What are Datasets in BigQuery?

Datasets are entities that live within a project. Datasets act as a container for BigQuery tables as well as views, routines and machine learning models.

Tables cannot live separately from datasets, so it is a requirement to create a dataset when creating a new data source within BigQuery Studio.

In addition to attributes like a human-readable name, developers are required to specify a location when authorizing the creation of a dataset. These locations correspond with the physical locations of Google data centers throughout the world.

When specifying a location, you'll need to specify either a single region or multi-region. For instance, instead of choosing a data center in Chicago, you would specify "us-central-1."

Establishing a dataset as a multi-regional entity provides the added advantage of BigQuery shifting the location when a particular region does not have the resources to keep up with current demand. The current multi-regions are located in either the Americas (U.S.) or EU (Europe).

Steps to Create a Dataset in BigQuery

To create a dataset, follow the steps given below. First, navigate to your project name and click the three dots which will trigger a pop-up with "create dataset" −

Once you click "create dataset", you'll be prompted to enter −

A dataset_id
A location type (region vs. multi-region).
A default table expiration (how many days until the table expires).

The end result is a dataset which serves as a container for future tables, views and materialized views.

A "Sharing" option allows developers to manage access control to datasets in order to limit unauthorized users.

BigQuery: Public Datasets

If you're new to BigQuery and, possibly, SQL in general, it's likely you may not have generated data to store and manipulate. This is one of the advantages of using BigQuery Studio as a SQL sandbox. In addition to serverless infrastructure, BigQuery also provides terabytes of sample data that students and professionals can use to learn and refine their SQL skills.

Published through the Google Cloud Public Dataset Program, BigQuery public datasets are stored in their own universally-accessible project: bigquery-public-data.
Developers can query up to 1 terabyte of data per month for free, in accordance with the pay-per-terabyte pricing model.
Unlike many stock datasets, the data contained within the tables is realistic, a.k.a. "messy" and, at times, requires significant transformation to yield actionable insights.

BigQuery also provides several sample tables independent of its BigQuery public datasets which can be found in the bigquery-public-data:samples table dataset −

gsod
github_nested
github_timeline
natality
shakespeare
trigrams
wikipedia

Perhaps the most significant advantage of accessing BigQuery public datasets is the fact that the data is ingested from real data sources like the BBC, Hacker News and Johns Hopkins University.

Print Page