Learn more about Google BigQuery Export Integration.

The integration for Google BigQuery enables the import of data from your BigQuery tables or from query results into Treasure Data.

You can use the same BigQuery connection for both import and export, but cannot use connections authenticated by OAuth for export.


Prerequisites

  • Basic knowledge of Treasure Data

  • Basic knowledge of Google Cloud Platform (BigQuery, Cloud Storage and IAM)

Limitations

  • OAuth is no longer supported for this connector. Only JSON keyfile is supported.
  • If your dataset is in a location other than the US or EU multi-region, you need to specify the location. Otherwise, your job in TD will fail with an error: Cannot find job_id xxxxx.

Permissions and Roles

For use of this data connector, authorized accounts (service account) must have the following permissions or IAM roles.

Category

Required permissions

Minimum IAM roles

To use table loading

  • bigquery.tables.get

  • bigquery.tables.getData

  • BigQuery Data Viewer

To use query loading

  • bigquery.jobs.create

  • BigQuery Job User

To use "Import Large Dataset"

  • bigquery.tables.export

  • bigquery.tables.delete

  • storage.buckets.get

  • storage.objects.list

  • storage.objects.create

  • storage.objects.delete

  • BigQuery Data Editor

  • Storage Legacy Bucket Writer

For more details about IAM permissions and roles see the Google Cloud documents: BigQuery and Cloud Storage.

Create a New Authentication

  1. Go to Integrations Hub > Catalog.

  2. Search and select Google BigQuery.

  3. A dialog will open, then choose an authentication mode. Only JSON keyfile is supported.

  4. Provide the JSON string of your service account key into the "JSON keyfile" section.
  5. See the Google Cloud document to create a new service account key.


Validate your Query

If you want to load a SQL result, select "Query statement", then input an SQL query into the "SQL statement".

Before creating transfer, confirm that your query is valid in the BigQuery Web UI.

https://cloud.google.com/bigquery/quickstart-web-ui

Create a New Transfer

After creating the connection, you are automatically taken to the Authentications tab. Look for the connection you created and select Source.

Fetch from

Configure the data source to import.

Input an ID of your Google Cloud Platform project into "Project ID".

Import Types

Select a type of import, either loading a whole table (table loading) or loading a SQL result (query loading).

Table Loading

If you want to load a whole table, select "Table", then provide the "Dataset name" and "Table name" that you want to export.


Query Loading

If you want to load a SQL result, select "Query statement", then input an SQL query into the "SQL statement".

Before creating transfer, confirm that your query is valid in the BigQuery Web UI.

https://cloud.google.com/bigquery/quickstart-web-ui


The default SQL dialect is Standard SQL. Check Use Legacy SQL if you want to use Legacy SQL.

By default, this connector uses cached result under specific conditions. Uncheck Use Cached Results if you want to disable caching.

Data Location

You must specify the location if your data is in a location other than the US or EU multi-region.

You must specify the location when your data in the asia-northeast1 region.

See the Google Cloud document for more details about the location.

Incremental Loading

Incremental loading can load only new records after last execution by using increasing, unique column(s), such as an auto-increment ID column or timestamp column for the created date.

To enable it, check Incremental Loading, then specify column names to increment into "Incremental Column Names."

Only numerical types (INTEGER and FLOAT) and TIMESTAMP type are supported as an incremental column.


How This Works

This connector records "last record" which is the latest record ordered by the incremental columns. In the next execution, it loads records by running a query built by the following rule using the last record:

With table loading, all fields are selected with the WHERE clause.

SELECT
  *
FROM
  `${dataset}.${table}`
WHERE
  ${incremental_column} > ${value_of_last_record}

With query loading, the raw query is wrapped with the WHERE clause.

SELECT
  *
FROM
  (${query}) embulk_incremental_
WHERE
  ${incremental_column} > ${value_of_last_record}

If there are multiple incremental columns (c1, c2 and c3, for example), the WHERE clause is similar to the following statement:

WHERE
  (c1 > 1)
OR
  (c1 = 1 AND c2 > 2)
OR
  (c1 = 1 AND c2 = 2 AND c3 > 3)

Import Large Dataset

When you load a large dataset (more than 500MB as a benchmark), we recommend that you use this "Import Large Dataset" option. This option exports the data as GCS (Google Cloud Storage) objects then loads the data in multiple tasks. Hence, loading is faster.

To enable this option, check Import Large Dataset then specify "Temp dataset", "Temp table", "GCS bucket" and "GCS path prefix".


How This Works

  • When running a query (query loading or table loading with incremental loading), the query result is exported to a temporary BigQuery table "temp.temp_table".

  • Then the temp table is exported to "gs://my-bucket/data-connector/result-[12 digists number].jsonl.gz" as gzipped JSON Lines files. The number of files depends on the size of result data.

  • With a table loading without incremental loading, all the data in the source table is directly exported to GCS.

  • After completion, the temp table and GCS objects are deleted.

Data Location of Temp Resources

  • Temp table must be in the same location as the tables you're querying. See the Google Cloud temporary and permanent tables document for more details.

  • GCS bucket must be also in the same location as the tables unless the dataset is set to "US." You can export data from a US-based dataset to a Cloud Storage bucket in another region. See the Google Cloud export limitations document for more details.

Preview

After configuring the data source, select Next, so that you see a sample result from the source. You receive an alert message with an error message if an error occurs.


Transfer to

After you select Next from the preview, you are asked to select the database and table in Treasure Data where you want to transfer the data into. If you want to create a new database, select Create new database and give your database a name. Do the same with Create new table.

Select whether to append records to an existing table or replace your existing table.

If you want to set a different partition key seed than the default key, you can select one from the "Partition key seed".


When

On the When tab, you can specify the import job as a one-time transfer, or you can schedule an automated recurring transfer. If you selected Once now, select Start Transfer. Or select Repeat and specify your schedule options, then select Schedule Transfer.

After your transfer runs, you can see the results of your transfer in the Databases tab.


Now you are ready to start analyzing your data.

Notes

Data Conversion

BigQuery's data types are automatically converted to a corresponding Treasure Data type, as indicated in the following table. If you include unsupported types in the schema of the table or query result, you receive errors.

BigQuery

Treasure Data

STRING

string

BYTES

Unsupported

INTEGER

long

FLOAT

double

NUMERIC

Unsupported

BOOLEAN

long (true is 1, false is 0)

TIMESTAMP

string (yyyy-MM-dd HH:mm:ss.SSS)

DATE

Unsupported

TIME

Unsupported

DATETIME

Unsupported

RECORD

string (as JSON)

REPEATED (PRIMITIVE or RECORD)

string (as JSON)

Use Connections for Input and Output

You can use the same BigQuery connection for both Data Connector (input) and Result Output (output), but, currently, cannot use connections authenticated by OAuth for output.

Quotas and Limits

Any quotas and limits of BigQuery and Cloud Storage are applied to your GCP project.

Use BigQuery Connector via CLI

If you prefer, you can use the connector via TD Toolbelt.

Set up the TD Toolbelt on the CLI.

Create Configuration File

Create configuration YAML file that is referred to as "config.yml" here.

Example (config.yml)

in:
  type: bigquery
  project_id: my-project
  auth_method: json_key
  json_keyfile:
    content: |
      {
        "type": "service_account",
        "project_id": "xxxxxx",
        ...
       }
  import_type: table
  dataset: my_dataset
  table: my_table
  incremental: true
  incremental_columns: [id]
  export_to_gcs: true
  temp_dataset: temp
  temp_table: temp_table
  gcs_bucket: my-bucket
  gcs_path_prefix: data-connector/result-
out:
  type: td

Authentication of GCP

JSON Key

Specify "auth_method: json_key" and put a JSON content of your service account key into "json_keyfile.content"

auth_method: json_key
json_keyfile:
  content: |
    {
      "type": "service_account",
      "project_id": "xxxxxx",
      ...
     } 

OAuth

If you want to use authorized account by your OAuth 2 application, specify "auth_method: oauth2", "client_id", "client_secret" and "refresh_token"

auth_method: oauth2
client_id: 000000000000-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.apps.googleusercontent.com
client_secret: yyyyyyyyyyyyyyyyyyyyyyyy
refresh_token: zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz

Import Type

Table Loading

With table loading, specify "import_type: table", "dataset" and "table"

import_type: table
dataset: my_dataset
table: my_table

Query Loading

With query loading, specify "import_type: query" and "query"

import_type: query
query: |-
  SELECT
    id, first_name, last_name, created_at
  FROM
    my_dataset.my_table
  WHERE first_name = "Treasure"

You can optionally specify "query_option". "use_leagacy_sql" is false by default and "use_query_cache" is true by default.

query: SELECT ...
query_option:
  use_legacy_sql: false
  use_query_cache: true 

Data Location

You can specify the location by "location" if necessary

location: asia-northeast1

Incremental Loading

To enable it, specify "incremental: true" and "incremental_columns"

incremental: true
incremental_columns: [id]

Import Large Dataset

To enable it, specify "export_to_gcs: true", then add "temp_dataset", "temp_table", "gcs_bucket" and "gcs_path_prefix"

export_to_gcs: true
temp_dataset: temp
temp_table: temp_table
gcs_bucket: my-bucket
gcs_path_prefix: data-connector/result-

(Optional) Preview

Run td connector:preview command to validate your configuration file

$ td connector:preview config.yml
+---------+-------------------+------------------+-------------------------------+
| id:long | first_name:string | last_name:string | created_at:timestamp          |
+---------+-------------------+------------------+-------------------------------+
| 1       | "Treasure"        | "Data"           | "2018-05-21 12:00:00.111 UTC" |
+---------+-------------------+------------------+-------------------------------+
1 row in set
Update config.yml and use 'td connector:preview config.yml' to preview again.
Use 'td connector:issue config.yml' to run Server-side bulk load.

Create New Connector Session

Run td connector:create.

By the following example, a daily import session with BigQuery connector is created.

$ td connector:create daily_bigquery_import \
    "10 0 * * *" td_sample_db td_sample_table config.yml
Name     : daily_bigquery_import
Cron     : 10 0 * * *
Timezone : UTC
Delay    : 0
Database : td_sample_db
Table    : td_sample_table
Config
---
in:
  ...

Data Partition Key

Connector sessions need at least one timestamp column in result data to be used as data partition key and the first timestamp column is chosen as the key by default. Use "--time-column" option if you want to explicitly specify a column.

$ td connector:create --time-column created_at \
    daily_bigquery_import ...

If your result data doesn't have any timestamp column, add the "time" column by adding the filter configuration as follows.

in:
  type: bigquery
  ...
filters:
- type: add_time
  from_value:
    mode: upload_time
  to_column:
    name: time
out:
  type: td




  • No labels