BigQuery - Basic Schema Design



Unlike an Excel spreadsheet, SQL tables do not automatically accept data as it is presented. The data source and table must be in agreement about the data's scope and types before ingestion can be successful. Types must be both consistent and in a format that BigQuery can parse and make available within BigQuery Studio.

What is a Schema?

To do this, developers must provide a schema. Essentially, a schema is an ordered list of attributes and their corresponding types.

In BigQuery, column order and the number of columns matters, so any supplied schema must match that of the source table.

Three Ways to Specify a Schema

In BigQuery, there are three ways to specify a schema −

  • Create a schema within the UI during the "create table" step.
  • Write or upload a schema as a JSON text file.
  • Tell BigQuery to auto-infer the schema.

Auto-infer the Schema

While auto inferring a schema is the least work for a developer, this can also introduce the most risk into a data pipeline.

Even if data types are consistent on one run, they could change unexpectedly. Without a fixed schema, BigQuery has to "figure out" which data type to accept, which can lead to schema mismatch errors.

The UI Method of Creating a Schema

Since the UI method of creating a schema is rather intuitive, the next part will focus on creating a schema as a JSON file.

Creating a Schema as a JSON File

The format for JSON schema is a dictionary "{ }" inside of a list "[ ]." Each field can have three attributes −

  • Field Name
  • Column Type
  • Column Mode

The default column mode is "NULLABLE" which means the column accepts NULL values. Other column modes will be covered during the discussion of nested data types.

An example of one line of a JSON schema would be −

{"name": "id", "type": "STRING", "mode": "NULLABLE"}

If you're simply adding a column or altering the type of an existing column, you can generate the schema of an existing table with this query −

[Generate schema query]

Just make sure results are set to "JSON" to copy/download the resulting JSON file.

GCP Cloud Shell: Create a Table

Cloud Shell is Google Cloud Platform's command line interface (CLI) tool that allows users to interact with data sources directly from a terminal window. Just as it's possible to create a table using the BigQuery UI in the GCP Console, it is possible to quickly create a table using a Linux-like syntax via the CLI.

Unlike provisioning a CLI on a local machine, as long as you're logged into a Google account, you're automatically logged into the Cloud Shell and, consequently, can interact with BigQuery resources in the terminal. It is also possible (but more complex) to provision the gcloud CLI in a local IDE.

The "bq" Command-line

In either case, the BigQuery cloud shell integration hinges on one command: bq. The bq command-line is a Python-based command line tool compatible with cloud shell.

To create a table, it is necessary to use "bq" in combination with "mk"

--bq mk 

This syntax is used in combination with the "table" or "-t" flag. There is also the option to specify multiple parameters, just like when creating a table within the BigQuery UI.

Available parameters include −

  • Expiration rules (expiration time in seconds)
  • Description
  • Label
  • Add tags (policy tags)
  • Project id
  • Dataset id
  • Table name
  • Schema

Here is an example

Note − An inline schema was provided.

Bq mk -t sample_dataset.bikeshare_table_cli
name:STRING,station_id:STRING,modified_date:TIMESTAMP

After successful execution, you will get an output like this −

BigQuery Basic Schema Design

There is no performance advantage when choosing the cloud shell over the UI; it simply comes down to user preference. However, creating a table in this way can be useful when it comes to creating recurring or automated processes.

Advertisements