Getting started with Google BigQuery

This page contains information to help you get started with Google BigQuery in Semarchy xDI.

Prerequisites

Google Cloud project metadata

Projects using Google BigQuery require access credentials.

You must create a Google Cloud Project Metadata, and configure it with Google Cloud platform credentials to connect to Google BigQuery.

Account permissions

In order for your Google BigQuery project to work correctly, the account you specify in the Google Cloud Project Metadata needs the bigquery.jobs.create permission. This permission is set within the Google Cloud platform.

The Google Cloud platform has three predefined Identity and Access Management (IAM) roles that include the permissions you need in order to run a job:

roles/bigquery.user
roles/bigquery.jobUser
roles/bigquery.admin

For more information on permissions, as well as custom IAM roles, see Introduction to BigQuery jobs on the Google Cloud website.

Google Cloud Storage metadata

Google BigQuery uses Google Cloud Storage as a place to store temporary files and optimize data loading.

It is recommended to define this temporary location by creating a Google Cloud Storage Metadata.

Connect to your data

Create the metadata

To create a Google BigQuery Metadata, launch the Metadata creation wizard, select the Google BigQuery Metadata in the list and follow the wizard.

The wizard will ask you to choose the credentials to use, with a list of all credentials defined in your workspace. If the list is empty, make sure the prerequisites have been set up correctly.

getting started bigquery metadata credentials

Select the credentials, and click next.

You can customize the URL with optional parameters. Afterwards, click Connect.

getting started bigquery metadata connect

On the next page, click Refresh Values on the Catalog Name, then select the Google Project from the list.

Click Refresh Values onto the Schema Name and select the Google BigQuery dataset to reverse-engineer from the list.

getting started bigquery metadata catalog

Click next, click Refresh to list the tables and select the ones to reverse-engineer.

Finally, click the Finish button. The tables will be reverse-engineered in the Metadata.

JDBC parameters

The JDBC URL to Google BigQuery supports optional parameters in standard URL query string syntax, like in the following example:

jdbc:semarchy:bigquery?backendErrorRetryNumber=2&backendErrorRetryDelay=1000

The following parameters are supported:

Parameter Description

Parameter	Description
backendErrorRetryNumber	The number of times to retry a query when a `jobBackendError` happens. If the error persists after these retries, the job stops and displays the corresponding message. An undefined parameter is equivalent to no retries.
backendErrorRetryDelay	Delay in milliseconds to wait between each retry.
internalErrorRetryNumber	The number of times to retry a query when an `internalError` or `jobInternalError` happens. If the error persists after these retries, the job stops and displays the corresponding message. An undefined parameter is equivalent to no retries.
internalErrorRetryDelay	Delay in milliseconds to wait between each retry.
tableUnavailableErrorRetryNumber	The number of times to retry a query when a `tableUnavailable` error happens. If the error persists after these retries, the job stops and displays the corresponding message. An undefined parameter is equivalent to no retries.
TableUnavailableErrorRetryDelay	Delay in milliseconds to wait between each retry.
quotaExceededErrorRetryNumber	The number of times to retry a query when a `quotaExceeded` error happens. If the error persists after these retries, the job stops and displays the corresponding message. An undefined parameter is equivalent to no retries.
quotaExceededErrorRetryDelay	Delay in milliseconds to wait between each retry.
jobsLocation	The location used for job operations. For example `us` or `eu`.
connectionTimeout	Timeout in milliseconds to establish a connection.
readTimeout	Timeout in milliseconds to read data from an established connection.

backendErrorRetryNumber

The number of times to retry a query when a jobBackendError happens. If the error persists after these retries, the job stops and displays the corresponding message.

An undefined parameter is equivalent to no retries.

backendErrorRetryDelay

Delay in milliseconds to wait between each retry.

internalErrorRetryNumber

The number of times to retry a query when an internalError or jobInternalError happens. If the error persists after these retries, the job stops and displays the corresponding message.

An undefined parameter is equivalent to no retries.

internalErrorRetryDelay

Delay in milliseconds to wait between each retry.

tableUnavailableErrorRetryNumber

The number of times to retry a query when a tableUnavailable error happens. If the error persists after these retries, the job stops and displays the corresponding message.

An undefined parameter is equivalent to no retries.

TableUnavailableErrorRetryDelay

Delay in milliseconds to wait between each retry.

quotaExceededErrorRetryNumber

The number of times to retry a query when a quotaExceeded error happens. If the error persists after these retries, the job stops and displays the corresponding message.

An undefined parameter is equivalent to no retries.

quotaExceededErrorRetryDelay

Delay in milliseconds to wait between each retry.

jobsLocation

The location used for job operations. For example us or eu.

connectionTimeout

Timeout in milliseconds to establish a connection.

readTimeout

Timeout in milliseconds to read data from an established connection.

You can also add the parameters as standalone properties under the top-level BigQuery metadata node.

Define the cloud storage link

For performance purposes, Semarchy xDI is using Cloud Storage to optimize the data loading on Google BigQuery

Drag-and-drop or select your previously created Google Cloud Storage Metadata inside the related property.

Choose a bucket or a folder, depending on your preferred organization.

getting started bigquery metadata storage link

This bucker/folder will be used as the temporary location when necessary, to optimize data loading into Google BigQuery.

Create your first mappings

Below are some examples of Google BigQuery usages in Mappings and Processes.

Example of Mapping loading data from an HSQL database to a Google BigQuery table

getting started bigquery mapping example 1

Example of Mapping loading data from multiple BigQuery tables with joins to an HSQL table

getting started bigquery mapping example 2

Cloud storage mode

When integrating data into Google BigQuery, data may be going through Google Cloud Storage for performance purposes.

Depending on the amount of data sent and network quality, for instance, different methods are available in Templates to have better performances.:

stream: Data is streamed directly in the Google Storage Bucket.
localfile: Data is first exported to a local temporary file, which is then sent to the defined Google Storage Bucket. This method should be preferred for large sets of data.

The storage method is defined on the Template:

getting started bigquery template storage method

Sample project

The Google BigQuery component is distributed with sample projects that contain various examples and files. Use these projects to better understand how the component works, and to get a head start on implementing it in your projects.

Refer to Install components in Semarchy xDI Designer to learn about importing sample projects.