# Google Drive Import Integration

You can import data files from Google Drive into Treasure Data. This data connector supports only CSV and TSV file or gzip of CSV and TSV files.

## Prerequisites

- Basic knowledge of Treasure Data, including the [TD Toolbelt](https://toolbelt.treasuredata.com/)
- A Google Account


## Use the TD Console to Create Your Connection

### Create a New Connection

Go to Integrations Hub > Catalog and search and select Google Drive.

![](/assets/image-20191015-140041.55e4e61badff4516641d1772e5151daff874fa5ea424f79d3fe2098efaa8c3f7.7a2fdaca.png)

Select **Create.** You are creating an authenticated connection.

The following dialog opens.

![](/assets/image-20191015-140051.adbe6f214e4339c29f1a9d9abbd41ac03cb017fb7d8cbba3a2315dd887944a38.7a2fdaca.png)

Access to Google Drive requires OAuth2 authentication.

Select **Click here** to connect to your Google account.

Log into your Google account in the popup window and grant access to the Treasure Data app.

![](/assets/image-20191015-140119.6afe6692f82df2f21e05953eaf4c969773d8d50b79259e94ee6ec81666ea41fc.7a2fdaca.png)

You will be redirected back to TD Console. Repeat the "Create a new connection" step and choose your new OAuth connection. Select **Continue.**

![](/assets/image-20191015-140127.ea8d47d5a36ae3e169dc5c06e7066dddf974763c7d7a2bb5a4d8ce028fc66ba5.7a2fdaca.png)

![](/assets/image2022-6-28_13-11-37.a70583d15051a4fb27cb40f858b9065f737b4269c20695bf067aa67ad3b65adc.7a2fdaca.png)

![](/assets/image-20191015-140135.f1c6bb9446cf951ba0a10fd97a63196a8d377c7cc3ea0a0ffcc40dc6624bdcae.7a2fdaca.png)

Name your new Google Drive Connection. Select **Done**.

## Transfer Your Data to Treasure Data

After creating the authenticated connection, you are automatically taken to the Authentications tab. Look for the connection you created and select **New Source**.

### Import from File

Import a CSV or TSV file or you can import a gzip of a CSV or TSV file.

For **File types**, select File.

![](/assets/image-20191015-140146.6e670d3d1c80a2ec112b9a3d227cf7938cbfb9e0aad8c758dda948ebd767c97b.7a2fdaca.png)

Parameters:

- **ID**: File Id from Google Drive. Supports only CSV and TSV files or gzip of CSV and TSV files.
- **Modified After**: Use to import only a file that is modified after this specified timestamp.
- **Incremental Loading**: Use when importing data based on a schedule. Use to import only new, modified files since the last import.


### Import from Folder

Import a list of CSV or TSV files or gzip of CSV or TSV files. When importing from folder, the folder must contain files of the same type. For example: a folder that contains only CSV files.

For **File types**, select Folder.

![](/assets/image-20191015-140159.3eeba24a34c2b070205b77472529ccb299eee5d75be38be2f58da88a09a47205.7a2fdaca.png)

Parameters:

- **ID**: File Id from Google Drive. Supports only CSV and TSV files or gzip of CSV and TSV files.
- **Modified After**: Use to import only a file that is modified after this specified timestamp.
- **Filename pattern**: Uses regex to match file name. If a file name doesn’t match with this pattern, the file is skipped. (for example, if the file name pattern is *.csv$* #  then, a file is skipped if its name doesn’t match with this pattern)
- **Incremental Loading**: Use when importing data based on a schedule. Use to import only new, modified files inside the folder since the last import.


### Preview

You’ll see a [preview](https://docs.treasuredata.com/smart/project-product-documentation/about-data-preview) of your data. To make changes, select **Advanced Settings,** otherwise select **Next**.

![](/assets/image-20191015-140209.9acb167c3082078c8cb3b843d94aa75b41e6c4334bad4c067952b904c8c671ee.7a2fdaca.png)

### Advanced Settings

![](/assets/image-20191015-140223.e038f81c890e85b582ce66d6dbd79ac398b6352195c331b0e571ca3eac031816.7a2fdaca.png)

You can specify the following parameters:

- Decoders: You can modify the file decoder based on the encode of the file.

```
Gzip
Bzip2
Zip
OpenPGP Decryption
```
- Parser: Update the config for the CSV/TSV Parser, if needed.

```
charset: UTF-8
newline: LF
type: csv
delimiter: ','
quote: '"'
escape: '"'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
```
- Maximum retry interval milliseconds. Specifies the maximum time between retries.

```
  Type: number
  Default: 120000
```
- Maximum retry times. Specifies the maximum retry times for each API call.

```
  Type: number
  Default: 7
```
- Initial retry interval millisecond. Specifies the wait time for the first retry.

```
  Type: number
  Default: 1000
```
- Maximum retry interval milliseconds. Specifies the maximum time between retries.

```
  Type: number
  Default: 120000
```


### Choose the Target Database and Table

Choose an existing source or create a new database and table.

![](/assets/image-20191209-223011.d49b378b87d9308f0d634574a82fbabe1edb6149d1eb2a8f3a3b02f0c7fdca7d.7a2fdaca.png)

Create a new database and give your database a name. Complete similar steps to  **Create new table.**

Select whether to **append** records to an existing table or **replace** your existing table.

If you want to set a different **partition key seed** rather than use the default key, you can specify one using the popup menu.

### Scheduling

In the **When** tab, you can specify one-time transfer, or schedule an automated recurring transfer.

Parameters

- **Once now**: set one time job.
- **Repeat…**
  - **Schedule**: accepts these three options: *@hourly*, *@daily* and *@monthly* and custom *cron*.
  - **Delay Transfer**: add a delay of execution time.
- **TimeZone**: supports extended timezone formats like ‘Asia/Tokyo’.


![](/assets/image-20191015-140248.9daa74ba0dc9dfeb5cecdec5ca398bfd08969e7656780fe3e64e2233078e7730.7a2fdaca.png)

### Details

Name your Transfer and select **Done** to start.

![](/assets/image-20191015-140257.6ab8e45a9e44ecd67b9d01b7d9b0b869fbb2c09d855c506f041c9d86e26f0b7e.7a2fdaca.png)

After your transfer has run, you can see the results of your transfer in the **Databases** tab.

## Use the Command Line to Create Your Google Drive Connection

You can use the TD Console to configure your connection.

### Install the Treasure Data Toolbelt

Install the newest [TD Toolbelt](https://toolbelt.treasuredata.com/).

### Create a Configuration File (seed.yml)

The configuration file includes an in: section where you specify what comes into the connector from Google Drive and an out: section where you specify what the connector puts out to the database in Treasure Data. For more details on available out modes, see the **Appendix**.

The following example shows how to specify import a file from Google Drive.


```yaml
in:
  type: google_drive
  client_id: xxxxxxxxxxx
  client_secret: xxxxxxxxxxx
  refresh_token: xxxxxxxxxxx
  target: file
  id: xxxxxxxxxxx(drive file id)
  last_modified_time: "YYYY-MM-DD HH:MM:SS"
  incremental: false
out: mode: append
```

The following example shows how to specify import files in a folder from Google Drive.


```yaml
in:
 type: google_drive
 client_id: xxxxxxxxxxx
 client_secret: xxxxxxxxxxx
 refresh_token: xxxxxxxxxxx
 target: folder
 id: xxxxxxxxxxx(drive folder id)
 last_modified_time: "YYYY-MM-DD HH:MM:SS"
 filename_pattern: xxxxxxxxxxx (Ex: abc.csv$)
 incremental: false
out:
 mode: append
```

### Run Guess Command to Generate Run Configuration File (load.yml)


```bash
td connector:guess seed.yml -o load.yml
```

### Preview the Data to be Imported (Optional)

You can preview data to be imported using the command td connector: preview.


```bash
td connector:preview load.yml
```

### Execute the Load Job

You use td connector:issue to execute the job.

You must specify the database and table where you want to store the data before you execute the load job.  Ex td_sample_db, td_sample_table


```bash
td connector:issue load.yml \
--database td_sample_db \
--table td_sample_table \
--time-column date_time_column
```

It is recommended to specify --time-column option, because Treasure Data’s storage is partitioned by time. If the option is not given, the data connector selects the first long or timestamp column as the partitioning time. The type of the column, specified by --time-column, must be either of long or timestamp type (use Preview results to check for the available column name and type. Generally, most data types have a last_modified_date column).

If your data doesn’t have a time column, you can add the column by using the add_time filter option. See details at [add_time filter](https://docs.treasuredata.com/smart/project-product-documentation/add_time-filter-function) plugin.

td connector:issue assumes you have already created a database (sample_db) and a table (sample_table). If the database or the table do not exist in TD, td connector:issue will fail. Therefore, you must create the database and table manually or use --auto-create-table with td connector:issue to automatically create the database and table.


```bash
$ td connector:issue load.yml \
      --database td_sample_db \
      --table td_sample_table \
      --time-column date_time_column \
      --auto-create-table
```

From the command line, submit the load job. Processing might take a couple of hours depending on the data size.

### Scheduled Execution

You can schedule periodic data connector execution for periodic File or Folder import. We configure our scheduler carefully to ensure high availability. By using this feature, you no longer need a cron daemon on your local data center.

Scheduled execution supports configuration parameters that control the behavior of the data connector during its periodic attempts to fetch data from Google Drive:

- `incremental` This configuration is used to control the load mode, which governs how the data connector fetches data from Google Drive based on last modified time of File or Files in Folder.
- last_modified_time This configuration is used to control the last modified time of files from the previous load job.


See "How Incremental Loading Works" in the below Appendix:  for details and examples.

### Create the Schedule

A new schedule can be created using the td connector:create command. The name of the schedule, cron-style schedule, the database and table where their data will be stored, and the data connector configuration file are required.

The `cron` parameter accepts these options: `@hourly`, `@daily` and `@monthly`.

By default, schedule is setup in UTC timezone. You can set the schedule in a timezone using -t or --timezone option. The `--timezone` option only supports extended timezone formats like 'Asia/Tokyo', 'America/Los_Angeles' etc. Timezone abbreviations like PST, CST are *not* supported and may lead to unexpected schedules.


```
$ td connector:create \
    daily_import \
    "10 0 * * *" \
    td_sample_db \
    td_sample_table \
    load.yml
```

It’s also recommended to specify the *--time-column* option, since Treasure Data’s storage is partitioned by time.


```
$ td connector:create \
    daily_import \
    "10 0 * * *" \
    td_sample_db \
    td_sample_table \
    load.yml \
    --time-column created_at
```

### List the Schedules

You can see the list of scheduled entries by entering the command td connector:list.


```
$ td connector:list+--------------+------------+----------+-------+--------------+-----------------+--------------------------------------------+| Name         | Cron       | Timezone | Delay | Database     | Table           | Config                                     |+--------------+------------+----------+-------+--------------+-----------------+--------------------------------------------+| daily_import | 10 0 * * * | UTC      | 0     | td_sample_db | td_sample_table | {"in"=>{"type"=>"google_drive",  |+--------------+------------+----------+-------+--------------+-----------------+--------------------------------------------+
```

## Appendix

### Modes for the Out Plugin

You can specify file import mode in the out section of the load.yml file.

The out: section controls how data is imported into a Treasure Data table.
For example, you may choose to append data or replace data in an existing table in Treasure Data.

Output modes are ways to modify the data as the data is placed in Treasure Data.

- **Append**(default): Records are appended to the target table.
- **Replace** (available In td 0.11.10 and later): Replaces data in the target table. Any manual schema changes made to the target table remain intact.


Examples:


```
in:
  ...
out:
  mode: append
in:
  ...
out:
  mode: replace
```

### How Incremental Loading Works

Incremental loading uses the last modified time to load only files that have modified time after the last modified time.

**Import from File with Incremental:**

At the first execution, this connector loads this file. If **incremental** **is true**, then the latest modified time is saved as file modified time.

Example:

- Import file File0001.csv:

```
+--------------+--------------------------+
|   Filename   |   modified time          |
+--------------+--------------------------+
| File0001.csv | 2019-05-04T10:00:00.123Z |
```


After the job finished, last_modified_time is set to 2019-05-04T10:00:00.123Z

At the next execution, only the file File0001.csv is imported if the file modified time is greater than '**2019-05-08T10:00:00.123Z'**, meaning that the file was updated after the first execution.

**Import from Folder with Incremental:**

At the first execution, this connector loads all files that have file type CSV/TSV or gzip of CSV/TSV. If **incremental is true**,  we go through the list of files in the folder and get the maximum modified time from the files.

Example:

- Import folder contains files:

```
+--------------+--------------------------+
|   Filename   |     Last update          |
+--------------+--------------------------+
| File0001.csv | 2019-05-04T10:00:00.123Z |
| File0011.csv | 2019-05-05T10:00:00.123Z |
| File0012.csv | 2019-05-06T10:00:00.123Z |
| File0013.csv | 2019-05-07T10:00:00.123Z |
| File0014.csv | 2019-05-08T10:00:00.123Z |
```
- **Max Modified Time**: 2019-05-08T10:00:00.123Z


After the job finished, the last_modified_time is set to Max Modified Time (2019-05-08T10:00:00.123Z).

At the next execution, only files that have modified time greater than **2019-05-08T10:00:00.123Z** are imported.

Example:

- Import folder has newly updated and added files:

```
+--------------+--------------------------+
|   Filename   |     Last update          |
+--------------+--------------------------+
| File0001.csv | 2019-05-04T10:00:00.123Z |
| File0011.csv | 2019-05-05T10:00:00.123Z |
| File0012.csv | 2019-05-06T10:00:00.123Z |
| File0013.csv | 2019-05-09T10:00:00.123Z |
| File0014.csv | 2019-05-08T10:00:00.123Z | 
| File0015.csv | 2019-05-09T13:00:00.123Z |
```
- **Max Modified Time**: 2019-05-09T13:00:00.123Z


Then only files: **File0013.csv** and **File0015.csv** are imported. After the job finished, last_modified_time is set to Max Modified Time (2019-05-09T13:00:00.123Z).

### Obtain Your client_id, client_secret, refresh_token

To issue jobs from CLI, you need **client_id**, **client_secret,** **refresh_token**.

Go to [https://console.cloud.google.com/projectcreate](https://console.cloud.google.com/projectcreate) to register a project.

![](/assets/image-20191015-140319.a2cd8ccea7a02ee36fdd4f1125f59ac6e6052eac04c083108059c8952cba84a1.7a2fdaca.png)

Name your project, then select **Create**.

Go to [https://console.cloud.google.com/apis/dashboard?](https://console.cloud.google.com/apis/dashboard) to enable Drive Service for your project.

![](/assets/image-20191015-140332.08cd8e0d8a2feeea780d1561c37ce50feb396a5eaa711cbb5fcd2c8073397b71.7a2fdaca.png)

![](/assets/image-20191015-140342.0c4d171dd4f6f107e7ed5d62070baf5d9ca038e681cc1d0e31a605648fd8fcdf.7a2fdaca.png)

Choose Google Drive API and select **Enable**.

![](/assets/image-20191015-140403.e538b3edf99594f269d6b8c6d488bcb6363e1757282d3852e987f77cd6a4fa06.7a2fdaca.png)

Go to [https://console.cloud.google.com/apis/credentials](https://console.cloud.google.com/apis/credentials), select your Project and create a new credential.

![](/assets/image-20191015-140437.ac1658e38b75d32ee99868f9a62c4ff971d77eae9b57570628824929fe73c421.7a2fdaca.png)

Go to the **OAuth consent screen**tab, provide the Application name, then select **Add scope** to add the /auth/drive scope then **Save**.

Go to the **Credentials** tab and click **Create credentials**then select OAuth client ID.

![](/assets/image-20191015-140450.19b272bf31d9fc78b0bdcc013f08bd7003b9a058c76dd9d5507a8cfb55a8fde7.7a2fdaca.png)

Name your credential and add a redirect URL to [https://developers.google.com/oauthplayground/](https://developers.google.com/oauthplayground/)

![](/assets/image-20191015-140509.5288c1ad0c500ab41f699bd95d4642741bd5ec85199e16dd82ac09f05bbb8c38.7a2fdaca.png)

Choose your credential to get your client_id and client_secret.

![](/assets/image-20191015-140924.40f14099cbb47d819667a06a87d88f9e61bb6ca223fea39722179f209bd01c13.7a2fdaca.png)

![](/assets/image-20191015-140939.8a84dd05858f1c97bfd5e4ba3d4ab36c73070fce0baff6abe8bc8e17993ce7ee.7a2fdaca.png)

Go to [https://developers.google.com/oauthplayground](https://developers.google.com/oauthplayground) to get a **refresh_token.**

First, select Drive API v3 and scope **auth/drive,**select Settings and provide your **client id, client secret,**that you can get from your credential, then select **Authorize APIs.**

![](/assets/image-20191015-140951.1fc16c5c2e4bcbc45a86290c3d13f253415df987ae3f275dbe65a873e19f0f95.7a2fdaca.png)

Log in with your Google account and select **Allow.**

![](/assets/image-20191015-141008.3724665d859d8f8f8ea05db9e0f34cfa22508ea3ecaa86676e3fdfcb110bdf55.7a2fdaca.png)

Select **Exchange authorization code for tokens.**

![](/assets/image-20191015-141019.36bcc3a6337d266184e827bb391627ba2fc3e948d241d5600b5a7af3ec18d68c.7a2fdaca.png)

![](/assets/image-20191015-141036.ef22333fbdcfb8c29e49b47185356030067e9e90a658f0c46a466fb0582ac201.7a2fdaca.png)

Copy the refresh token and store it to use later.

### A Note about Token Expiration

A refresh token might stop working for one of these reasons:

- The user has revoked your app's access.
- The refresh token has not been used for six months.
- The user changed passwords and the refresh token contains Gmail scopes.
- The user account has exceeded a maximum number of granted (live) refresh tokens.


There is currently a limit of 50 refresh tokens per user account per client. If the limit is reached, creating a new refresh token automatically invalidates the oldest refresh token without warning.

Reference: [https://developers.google.com/identity/protocols/OAuth2](https://developers.google.com/identity/protocols/OAuth2)

## Obtain File ID or Folder ID from Google Drive

Go to [https://drive.google.com](https://drive.google.com/) and login with your Google account.

**Get Folder ID:**

Select the folder that you want to get a Folder ID from and copy the ID from URL.

![](/assets/image-20191015-141047.e178d70f0c385526a7268f0fa06e35d0e754917e3996e60949f175fd5df6267c.7a2fdaca.png)

**Get File ID:**

From your drive, choose a file and then right-click and select "**Get shareable link**".

![](/assets/image-20191015-141058.45cefa2678a35432ae686d7083a2cd047c48b794be415ee4b2dc03934c7fe65f.7a2fdaca.png)

Turn on **Link sharing**andcopy the File ID from the link.

![](/assets/image-20191015-141118.e66e152c01509dff6c5aae223271d7b6ffd7c35cdddc11f9c6cf22965b4d4ea2.7a2fdaca.png)