The import data connector for Amazon S3 enables you to import the data from your JSON, TSV, and CSV files stored in an S3 bucket.

Differences between Amazon S3 Import Integration v2 and Amazon S3 Import Integration v1

Review the information in the following table to understand the differences and potential advantages between v2 and v1. For v1 details, see Amazon S3 Import Integration v1.

Authentication MethodAmazon S3 v2Amazon S3 v1
basicxx
anonymous
x
session
xx
assume_rolex

With assume_role as authentication method, you cannot import Amazon S3 data via workflow.

Prerequisites

You must have basic knowledge of Treasure Data.

You must set up an access route in AWS if you are using an AWS S3 bucket located in the same region as your TD region. You set up the access route by specifying the VPC. For example, if in the US region, configure access through vpc-df7066ba. If in the Tokyo region, configure access through vpc-e630c182 and, for the EU01 region, vpc-f54e6a9e.

Look up the region of TD Console by the URL you are logging in to TD, then refer to the data connector of your region in the URL.

Static IP Address of Treasure Data

The static IP address of Treasure Data is the access point and source of the linkage for this Integration. To determine the static IP address, contact your Customer Success representative or Technical support.

Import from AWS S3 via TD Console

Use the TD Console to Create Your Connection

You can use TD Console to create your data connector.

Create a New Connection

When you configure a data connection, you provide authentication to access the integration. In Treasure Data, you configure the authentication and then specify the source information.

  1. Navigate to Integrations Hub > Catalog and search for AWS S3.

  2. Select Create Authentication.

  3. A new Authentication dialog opens.

  4. Set the following parameters. Select Continue. Name your new AWS S3 connection. Select Done.



Endpoint

Authentication Method


basic

  • Uses access_key_id and secret_access_key to authenticate. See AWS Programmatic access.

    • Access Key ID

    • Secret access key

session

  • Uses temporary-generated access_key_id, secret_access_key and session_token. 

    • Access Key ID

    • Secret access key

    • Secret token

assume_role

  • Uses role access. See AWS AssumeRole.
    • TD's Instance Profile

    • Account ID

    • Your Role Name

    • External ID
    • Duration In Seconds

Access Key ID

AWS S3 issued

Secret Access Key

AWS S3 issued

Secret token


TD's Instance Profile


Account ID

Your AWS Account ID

Your Role Name

Your AWS Role Name

External ID

Your Secret External ID

Duration In Seconds

Duration For The Temporary Credentials

Create an Authentication with the assume_role authentication method

1. Create a new authentication with the assume_role authentication method

2. Create your AWS IAM role.


Transfer Your AWS S3 Data to Treasure Data

After creating the authenticated connection, you are automatically taken to Authentications.

  1. Search for the connection you created. 

  2. Select New Source.


Connection

  1. Type a name for your Source in the Data Transfer field.

  2. Click Next

Source Table

The Source dialog opens.

  1. Edit the following parameters.



Parameters

Description

Bucket

  • Provide the S3 bucket name (Ex. your_bucket_name)

Path Prefix

  • Specifies a prefix for target keys. (Ex. logs/data_)

Path Regex

  • Use regexp to match file paths. If a file path doesn’t match the specified pattern, the file is skipped. For example, if you specify the pattern .csv$ #, then a file is skipped if its path doesn’t match the pattern. Read more about regular expressions.

Skip Glacier Objects

  • Select to skip processing objects stored in the Amazon Glacier storage class. If objects are stored in the Glacier storage class, but this option is not checked, an exception is thrown.

Filter by Modified Time

Unchecked (default):


Checked:

  • Choose how to filter files for ingestion:

  • Start after path: inserts last_path parameter so that the first execution skips files before the path. (Ex. logs/data_20170101.csv)

  • Incremental: enables incremental loading. If incremental loading is enabled, config diff for the next execution includes the last_path parameter so that the next execution skips files before the path. Otherwise, last_path is not included.

  • Modified after: inserts last_modified_time parameters so that the first execution skips files that were modified before that specified timestamp (Ex. 2019-06-03T10:30:19.806Z)

  • Incremental by Modified Time: enables incremental loading by modified time. If incremental loading is enabled, config diff for the next execution includes the last_modified_time parameter so that the next execution skips files that were modified before that time. Otherwise, last_modified_time is not included.

You can limit access to your S3 bucket/IAM user by using a list of static IPs. Contact support@treasuredata.com if you need static IPs.

There are instances where you might need to scan all the files in a directory (such as from the top-level directory "/"). In such instances, you must use the CLI to do the import.

Example

Amazon CloudFront is a web service that speeds up the distribution of your static and dynamic web content. You can configure CloudFront to create log files that contain detailed information about every user request that CloudFront receives. If you enable logging, you can save CloudFront log files, shown as follows:

[your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-15.a103fd5a.gz]
[your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-15.b2aede4a.gz]
[your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-16.594fa8e6.gz]
[your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-16.d12f42f9.gz]

In this case, the Source Table settings are as shown:

  • Bucket: your_bucket

  • Path Prefix: logging/

  • Path Regex: .gz$ (Not Required)

  • Start after path: logging/E231A697YXWD39.2017-04-23-15.b2aede4a.gz (Assuming that you want to import the log files from 2017-04-23-16.)

  • Incremental: true (if you want to schedule this job.)

BZip2 decoder plugin is supported as the default.  See File Decoder Function.

Data Settings

  1. Select Next.
    The Data Settings page opens.

  2. Optionally, edit the data settings or skip this page of the dialog.

Filters


Import Integration Filters enable you to modify your imported data after you have completed Editing Data Settings for your import.

To apply import integration filters:

Select Next in Data Settings.

The Filters dialog opens.

Select the filter option you want to add.


Select Add Filter.

The parameter dialog for that filter opens.

Edit the parameters.

For information on each filter type, see one of the following:
Retaining Columns Filter
Adding Columns Filter
Dropping Columns Filter
Expanding JSON Filter
Digesting Filter

Optionally, to add another filter of the same type, select Add within the specific column filter dialog.
Optionally, to add another filter of a different type, select the filter option from the list and repeat the same steps.
After you have added the filters you want, select Next.
The Data Preview dialog opens.

Data Preview


You can see a preview of your data before running the import by selecting Generate Preview.

Data shown in the data preview is approximated from your source. It is not the actual data that is imported.

  1. Click Next.
    Data preview is optional and you can safely skip to the next page of the dialog if you want.

  2. To preview your data, select Generate Preview. Optionally, click Next

  3. Verify that the data looks approximately like you expect it to.


  4. Select Next.

Data Placement

For data placement, select the target database and table where you want your data placed and indicate how often the import should run.

  1.  Select Next. Under Storage you will create a new or select an existing database and create a new or select an existing table for where you want to place the imported data.

  2. Select a Database > Select an existing or Create New Database.

  3. Optionally, type a database name.

  4. Select a Table> Select an existing or Create New Table.

  5. Optionally, type a table name.

  6. Choose the method for importing the data.

    • Append (default)-Data import results are appended to the table.
      If the table does not exist, it will be created.

    • Always Replace-Replaces the entire content of an existing table with the result output of the query. If the table does not exist, a new table is created. 

    • Replace on New Data-Only replace the entire content of an existing table with the result output when there is new data.

  7. Select the Timestamp-based Partition Key column.
    If you want to set a different partition key seed than the default key, you can specify the long or timestamp column as the partitioning time. As a default time column, it uses upload_time with the add_time filter.

  8. Select the Timezone for your data storage.

  9. Under Schedule, you can choose when and how often you want to run this query.

    • Run once:
      1. Select Off.

      2. Select Scheduling Timezone.

      3. Select Create & Run Now.

    • Repeat the query:

      1. Select On.

      2. Select the Schedule. The UI provides these four options: @hourly, @daily and @monthly or custom cron.

      3. You can also select Delay Transfer and add a delay of execution time.

      4. Select Scheduling Timezone.

      5. Select Create & Run Now.

 After your transfer has run, you can see the results of your transfer in Data Workbench > Databases.

Validating Your Data Connector Jobs

How do I troubleshoot data import problems?

Review the job log. Warning and errors provide information about the success of your import. For example, you can identify the source file names associated with import errors.

To find out more about a specific job, you can select that job and see details. Depending on the type of job, you can see some or all of the following: results, query, output logs, engine logs, details, and destination.

  1. Open the TD Console.

  2. Navigate to Jobs. You can review the number of jobs which is listed in the upper right of the page.


  3. Optionally, use filters to reduce the listing of jobs to locate what you are interested in. Including filtering by job owner, date, and database name.

  4. Select a job to open it and view results, query definition, logs, and other details.


  5. Each tab has different information about the job.



Results

  • View the imported data from the job.

  • From here you can copy the results to the clipboard or download them as a CSV file.

Query

  • View the query syntax of the job

  • Launch a query editor

  • Copy queries and use to create new queries or workflows

  • Refine queries to improve efficiency

Output and Engine Logs

  • Log information can be reviewed for run times, query result numbers, and error codes

  • Log information can be copied to the clipboard

Details

View further details:

  • query name

  • type

  • job id

  • status

  • duration

  • scheduled and actual times

  • result count and size

  • runner,

  • database queried

  • priority

Destination

Here you can view details of an export integration configuration:

  • integration

  • type

  • settings

What can I do if the data connector for the S3 job is running for a long time?

Check the count of S3 files that your connector job is ingesting. If there are over 10,000 files, the performance degrades. To mitigate this issue, you can:

  • Narrow path_prefix option and reduce the count of S3 files.

  • Set 268,435,456 (256MB) to min_task_size option.

Import from AWS S3 via Workflow

The key difference and benefit of Amazon S3 Import Integration v2 over v1 is the added support for assume_role. However, with assume_role as authentication method, you cannot import Amazon S3 data via workflow. 

Import from AWS S3 via CLI (Toolbelt)

Optionally, you can use the TD Toolbelt to configure the connection, create the job, and schedule executions.


You can't use connector:guess, connector:preview, connector:issue, connector:create, and connector:update command if the authentication method is assume_role.

Use the CLI to Configure the Connector

Before setting up the connector, install the ‘td’ command. Install the most current TD Toolbelt.

Create Seed Config File (seed.yml)

Prepare the seed.yml as shown in the following example with your AWS access keys. You must also specify the bucket name and source file name (or prefix for multiple files).

in:
  type: s3_v2
  access_key_id: XXXXXXXXXX
  secret_access_key: YYYYYYYYYY
  bucket: sample_bucket
  # path to the *.json or *.csv or *.tsv file on your s3 bucket
  path_prefix: path/to/sample_file
  path_match_pattern: \.csv$ # a file will be skipped if its path doesn't match with this pattern

  ## some examples of regexp:
  #path_match_pattern: /archive/ # match files in .../archive/... directory
  #path_match_pattern: /data1/|/data2/ # match files in .../data1/... or .../data2/... directory
  #path_match_pattern: .csv$|.csv.gz$ # match files whose suffix is .csv or .csv.gz
out:
  mode: append

The Data Connector for Amazon S3 imports all files that match the specified prefix. (e.g. path_prefix: path/to/sample_ –> path/to/sample_201501.csv.gz, path/to/sample_201502.csv.gz, …, path/to/sample_201505.csv.gz).

Using path_prefix with leading '/', can lead to unintended results. For example: "path_prefix: /path/to/sample_file" would result in plugin looking for file in s3://sample_bucket//path/to/sample_file which is different on S3 than the intended path of s3://sample_bucket/path/to/sample_file.

Guess Fields (Generate load.yml)

Use connector:guess. This command automatically reads the source files and assesses (uses logic to guess) the file format and its field/columns.

$ td connector:guess seed.yml -o load.yml

If you open up load.yml, you’ll see the assessed file format definitions, including file formats, encodings, column names, and types.

in:
  type: s3_v2
  access_key_id: XXXXXXXXXX
  secret_access_key: YYYYYYYYYY
  bucket: sample_bucket
  path_prefix: path/to/sample_file
  parser:
    charset: UTF-8
    newline: CRLF
    type: csv
    delimiter: ','
    quote: '"'
    escape: ''
    skip_header_lines: 1
    columns:
    - name: id
      type: long
    - name: company
      type: string
    - name: customer
      type: string
    - name: created_at
      type: timestamp
      format: '%Y-%m-%d %H:%M:%S'
out:
  mode: append

Then, you can see a preview of the data using the td connector:preview command.

$ td connector:preview load.yml
+-------+---------+----------+---------------------+
| id    | company | customer | created_at          |
+-------+---------+----------+---------------------+
| 11200 | AA Inc. |    David | 2015-03-31 06:12:37 |
| 20313 | BB Imc. |      Tom | 2015-04-01 01:00:07 |
| 32132 | CC Inc. | Fernando | 2015-04-01 10:33:41 |
| 40133 | DD Inc. |    Cesar | 2015-04-02 05:12:32 |
| 93133 | EE Inc. |     Jake | 2015-04-02 14:11:13 |
+-------+---------+----------+---------------------+

The guess command needs more than three rows and two columns in the source data file because the command assesses the column definition using sample rows from source data.

If the system detects your column name or column type unexpectedly, modify load.yml directly and preview again.

Currently, the Data Connector supports parsing of “boolean” “long” “double” “string” and “timestamp” types.

Execute Load Job

Submit the load job. It may take a couple of hours, depending on the size of the data. Specify the Treasure Data database and table where the data should be stored.

It’s also recommended to specify --time-column option because Treasure Data’s storage is partitioned by time (see data partitioning) If the option is not provided, the data connector chooses the first long or timestamp column as the partitioning time. The type of the column specified by --time-column must be either of long and timestamp type.

If your data doesn’t have a time column you can add a time column by using add_time filter option. For more details see add_time filter plugin.

$ td connector:issue load.yml --database td_sample_db --table td_sample_table \
  --time-column created_at

The connector:issue command assumes that you have already created a database(td_sample_db)and a table(td_sample_table). If the database or the table do not exist in TD, this command will not succeed, so create the database and table manually or use --auto-create-table option with td connector:issue command to auto-create the database and table:

$ td connector:issue load.yml --database td_sample_db --table td_sample_table --time-column created_at --auto-create-table

The data connector does not sort records on the server-side. To use time-based partitioning effectively, sort records in files beforehand.

If you have a field called time, you don’t have to specify the --time-column option.

$ td connector:issue load.yml --database td_sample_db --table td_sample_table

Import Modes

You can specify file import mode in the out section of the load.yml file.

The out: section controls how data is imported into a Treasure Data table.
For example, you may choose to append data or replace data in an existing table in Treasure Data.

Mode

Description

Examples

Append

Records are appended to the target table.

in:
  ...
out:
  mode: append

Always Replace

Replaces data in the target table. Any manual schema changes made to the target table remain intact.

in:
  ...
out:
  mode: replace

Replace on new data

Replaces data in the target table only when there is new data to import.

in:
  ...
out:
  mode: replace_on_new_data

Scheduling Executions

You can schedule periodic data connector execution for incremental file import. We configure our scheduler carefully to ensure high availability.

For the scheduled import, you can import all files that match the specified prefix and one of these fields by condition:

  • If use_modified_time is disabled, the last path is saved for the next execution. On the second and subsequent runs, the connector only imports files that come after the last path in alphabetical order.

  • Otherwise, the time that the job is executed is saved for the next execution. On the second and subsequent runs, the connector only imports files that were modified after that execution time in alphabetical order.

Create a Schedule Using the TD Toolbelt

A new schedule can be created using the td connector:create command.

$ td connector:create daily_import "10 0 * * *" \
    td_sample_db td_sample_table load.yml

It’s also recommended to specify the --time-column option, because Treasure Data’s storage is partitioned by time (see also data partitioning).

$ td connector:create daily_import "10 0 * * *" \
    td_sample_db td_sample_table load.yml \
    --time-column created_at

The `cron` parameter also accepts three special options: `@hourly`, `@daily`, and `@monthly`.

By default, the schedule is setup in the UTC timezone. You can set the schedule in a timezone using -t or --timezone option. `--timezone` option supports only extended timezone formats like 'Asia/Tokyo', 'America/Los_Angeles' etc. Timezone abbreviations like PST, CST are *not* supported and may lead to unexpected schedules.

List All Schedules

You can see the list of currently scheduled entries by running the command td connector:list.

$ td connector:list
+--------------+--------------+----------+-------+--------------+-----------------+------------------------------------------+
| Name         | Cron         | Timezone | Delay | Database     | Table           | Config                                   |
+--------------+--------------+----------+-------+--------------+-----------------+------------------------------------------+
| daily_import | 10 0 * * *   | UTC      | 0     | td_sample_db | td_sample_table | {"in"=>{"type"=>"s3", "access_key_id"... |
+--------------+--------------+----------+-------+--------------+-----------------+------------------------------------------+

Show Schedule Settings and History

td connector:show shows the execution setting of a schedule entry.

Where:



<access_key_id>

Allows you to access the TD AWS Services

<secret_access_key>

Allows you to access the TD AWS Services

<endpoint>

A computer that communicates back and forth with a network

Example value: s3.amazonaws.com

<bucket>

Container object within a database

Example value: https://my-bucket.s3.us-west-2.amazonaws.com.

<path_prefix>

Specify a prefix for target keys

Example values:

logging/

path/to/sample_201501.csv.gz, path/to/sample_201502.csv.gz, …, path/to/sample_201505.csv.gz


% td connector:show daily_import
Name     : daily_import
Cron     : 10 0 * * *
Timezone : UTC
Delay    : 0
Database : td_sample_db
Table    : td_sample_table
Config
---
in:
  type: s3
  access_key_id: <access_key_id>
  secret_access_key: <secret_access_key>
  endpoint: <endpoint>
  bucket: <bucket>
  path_prefix: <path_prefix>
  parser:
    charset: UTF-8
    ...

td connector:history shows the execution history of a schedule entry. To investigate the results of each individual run, use td job <jobid>.

% td connector:history daily_import
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
| JobID  | Status  | Records | Database     | Table           | Priority | Started                   | Duration |
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
| 578066 | success | 10000   | td_sample_db | td_sample_table | 0        | 2015-04-18 00:10:05 +0000 | 160      |
| 577968 | success | 10000   | td_sample_db | td_sample_table | 0        | 2015-04-17 00:10:07 +0000 | 161      |
| 577914 | success | 10000   | td_sample_db | td_sample_table | 0        | 2015-04-16 00:10:03 +0000 | 152      |
| 577872 | success | 10000   | td_sample_db | td_sample_table | 0        | 2015-04-15 00:10:04 +0000 | 163      |
| 577810 | success | 10000   | td_sample_db | td_sample_table | 0        | 2015-04-14 00:10:04 +0000 | 164      |
| 577766 | success | 10000   | td_sample_db | td_sample_table | 0        | 2015-04-13 00:10:04 +0000 | 155      |
| 577710 | success | 10000   | td_sample_db | td_sample_table | 0        | 2015-04-12 00:10:05 +0000 | 156      |
| 577610 | success | 10000   | td_sample_db | td_sample_table | 0        | 2015-04-11 00:10:04 +0000 | 157      |
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
8 rows in set

Delete Schedule

td connector:delete removes the schedule.

$ td connector:delete daily_import

IAM Permissions

The IAM credentials specified in the YML configuration file and used for the connector:guess and connector:issue commands need to be allowed permissions for the AWS S3 resources that they need to access. If the IAM user does not possess these permissions, configure the user with one of the predefined Policy Definitions or create a new Policy Definition in JSON format.

The following example is based on the Policy Definition reference format, giving the IAM user read only (through GetObject and ListBucket actions) permission for the your-bucket:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket",
        "arn:aws:s3:::your-bucket/*"
      ]
    }
  ]
}

Replace your-bucket with the actual name of your bucket.

Use AWS Security Token Service (STS) as Temporary Credentials Provider

In certain cases, IAM basic authentication through access_key_id and secret_access_key might be too risky (although the secret_access_key is never clearly shown when a job is executed or after a session is created).

The S3 data connector can use AWS Secure Token Service (STS) provided Temporary Security Credentials. Using AWS STS, any IAM user can use his own access_key_id and secret_access_key to create a set of temporary new_access_key_id, new_secret_access_key, and session_token keys with an associated expiration time, after which the credentials become invalid.
The following are types of Temporary Security Credentials:

  • Session Token
    The simplest Security Credentials with an associated expiration time. The temporary credentials give access to all resources the original IAM credentials used to generate them had. These credentials are valid as long as they are not expired and the permissions of the original IAM credentials don’t change.

  • Federation Token
    Adds an extra layer of permission control over the Session Token above. When generating a Federation Token, the IAM user is required to specify a Permission Policy definition. The scope can be used to further narrow down which of the resources, accessible to the IAM user, the bearer of the Federation Token should get access to. Any Permission Policy definition can be used but the scope of the permission is limited to only all or a subset of the permissions the IAM user used to generate the token had. As for the Session Token, the Federation Token credentials are valid as long as they are not expired and the permissions associated to the original IAM credentials don’t change.

AWS STS Temporary Security Credentials can be generated using the AWS CLI or the AWS SDK in the language of your choice.

Session Token

$ aws sts get-session-token --duration-seconds 900
{
    "Credentials": {
        "SecretAccessKey": "YYYYYYYYYY",
        "SessionToken": "ZZZZZZZZZZ",
        "Expiration": "2015-12-23T05:11:14Z",
        "AccessKeyId": "XXXXXXXXXX"
    }
}

Federation Token

$ aws sts get-federation-token --name temp_creds --duration-seconds 900 \
  --policy '{"Statement": [{"Effect": "Allow", "Action": ["s3:GetObject", "s3:ListBucket"], "Resource": "arn:aws:s3:::bucketname"}]}'
{
    "FederatedUser": {
        "FederatedUserId": "523683666290:temp_creds",
        "Arn": "arn:aws:sts::523683666290:federated-user/temp_creds"
    },
    "Credentials": {
        "SecretAccessKey": "YYYYYYYYYY",
        "SessionToken": "ZZZZZZZZZZ",
        "Expiration": "2015-12-23T06:06:17Z",
        "AccessKeyId": "XXXXXXXXXX"
    },
    "PackedPolicySize": 16
}

where: * temp_cred is the name of the Federated token/user * bucketname is the name of the bucket to give access to. Refer to the ARN specification for more details * s3:GetObject and s3:ListBucket are the basic read operation for a AWS S3 bucket.

AWS STS credentials cannot be revoked. They will remain effective until expired, or until you delete or remove the permissions of the original IAM user used to generate the credentials.

When your Temporary Security Credentials are generated, copy the SecretAccessKey, AccessKeyId, and SessionToken in your seed.yml file as follows.

in:
  type: s3_v2
  auth_method: session
  access_key_id: XXXXXXXXXX
  secret_access_key: YYYYYYYYYY
  session_token: ZZZZZZZZZZ
  bucket: sample_bucket
  path_prefix: path/to/sample_file

and execute the Data Connector for S3 as usual.

Credential Expiration

Because STS credentials expire after the specified amount of time, the data connector job that uses the credential might eventually start failing when credential expiration occurs.
Currently, if the STS credentials are reported expired, the data connector job retries up to the maximum number of times (5) and eventually complete with 'error' status.


  • No labels