BigQuery - User-defined Functions



One of the advantages of BigQuery is the ability to create custom logic to manipulate data. In programming languages like Python, developers can easily write and define functions that can be utilized in multiple places within a script.

Persistent User-defined Functions in BigQuery

Many SQL dialects, including BigQuery, support these functions. BigQuery calls them persistent user-defined functions. They're either UDFs (user-defined functions) or PUDFs (persistent user-defined functions) for short.

The essence of user-defined functions can be broken into two steps −

  • Defining function logic
  • Employing a function within a script

Defining a User-defined Function

Defining a user-defined function begins with a familiar CRUD statement: CREATE OR REPLACE.

Here, instead of CREATE OR REPLACE TABLE, it is necessary to use CREATE OR REPLACE FUNCTION followed by the AS() command.

Unlike other SQL queries that can be written within BigQuery, when creating a UDF, specifying an input field and type is required.

These inputs are defined in a similar way to a Python function −

(column_name, type)

To see these steps put together, I've created a simple temporary UDF, specified by TEMP FUNCTION that parses various URLs based on user input.

Defining a User-defined Function

The steps to create the above temporary function are −

  • CREATE TEMP FUNCTION
  • Specify function name (get_sitetype)
  • Specify function input and type (inp_url, STRING)
  • Tell the function what type to return (STRING)

The REGEXP_CONTAINS() function searches for a match on strings that contain the URL strings provided. The NET.HOST() function extracts the host domain from the input URL string.

Applying this to the hacker news dataset (a BigQuery public dataset), we can generate an output that classifies the stored URLs into different categories of media −

Defining a User-defined Function

Note − Temporary functions must be immediately followed by a query.

Advertisements