BigQuery - Integrate Airflow



Running a Python script to load a BigQuery table can be helpful for an individual job. However, when a developer needs to create several sequential tasks, isolated solutions aren't optimal. Therefore, it is necessary to think beyond simple execution. Orchestration is required.

BigQuery can be integrated with several popular orchestration solutions like Airflow and DBT. However, this tutorial will highlight Airflow.

Directed Acrylic Graphs (DAG)

Apache Airflow allows developers to create execution blocks called Directed Acrylic Graphs (DAG). Each DAG is comprised of many tasks.

Each task requires an operator. There are two important BigQuery-compatible operators −

  • BigQueryCheck Operator
  • BigQueryExecuteQuery Operator

BigQueryCheck Operator

The BigQueryCheckOperator allows developers to conduct upstream checks to determine whether or not data has been updated for the day.

If the table does not have an upload timestamp included in its schema, it is possible to query metadata (as discussed earlier).

Developers can determine the time a table was last updated by running a version of this query −

BigQueryCheck Operator

BigQueryExecuteQuery Operator

To execute SQL scripts that depend on upstream data, SQL developers can use the BigQueryExecuteQuery operator to create a load job.

BigQueryExecuteQuery Operator

A deeper explanation of Airflow is beyond the scope of this tutorial, but GCP provides extensive documentation for those who would like to learn more.

Advertisements