Learn more about Amazon S3 Export Integration.

The data connector for Amazon S3 enables you to import the data from your JSON, TSV, and CSV files stored in an S3 bucket.

For sample workflows on importing data from files stored in an S3 bucket, go to the Treasure Box on Github.

An update to provide support for AssumeRole is coming in Spring 2022.


Prerequisites

You must have basic knowledge of Treasure Data.

You must set up an access route in AWS if you are using an AWS S3 bucket located in the same region as your TD region. You set up the access route by specifying the VPC. For example, if in the US region, configure access through vpc-df7066ba. If in the Tokyo region, configure access through vpc-e630c182 and, for the EU01 region, vpc-f54e6a9e.

Look up the region of TD Console by the URL you are logging in to TD, then refer to the data connector of your region in the URL.

 Region of TD Console 

 URL 

 US

 https://console.treasuredata.com 

 Tokyo

 https://console.treasuredata.co.jp 

 EU01

 https://console.eu01.treasuredata.com 

Use the TD Console to Create Your Connection

You can use TD Console to create your data connector.

Create a New Connection

When you configure a data connection, you provide authentication to access the integration. In Treasure Data, you configure the authentication and then specify the source information.

  1. Navigate to Integrations Hub > Catalog and search for AWS S3.

  2. Select Create Authentication.

  3. New Authentication dialog opens. You need a Access key ID and a Secret access key to authenticate using credentials.

  4. Set the following parameters. Select Continue. Name your new AWS S3 connection. Select Done.



Endpoint

Authentication Method


basic

  • Uses access_key_id and secret_access_key to authenticate. See AWS Programmatic access.

    • Access Key ID

    • Secret access key

anonymous

  • Uses anonymous access. This auth method can access only public files.

session
(Recommended)

  • Uses temporary-generated access_key_id, secret_access_key and session_token. (This authentication method is only available with data import. This can't be used with data export for now.)

    • Access Key ID

    • Secret access key

    • Secret token

Access Key ID

AWS S3 issued

Secret Access Key

AWS S3 issued


Transfer Your AWS S3 Data to Treasure Data

After creating the authenticated connection, you are automatically taken to Authentications.

  1. Search for the connection you created. 

  2. Select New Source.


Connection

  1. Type a name for your Source in the Data Transfer field.

  2. Click Next

Source Table

  1. The Source dialog opens. Edit the following parameters


Parameters

Description

Bucket

  • provide the S3 bucket name (Ex. your_bucket_name)

Path Prefix

  • specify a prefix for target keys. (Ex. logs/data_)

Path Regex

  • use regexp to match file paths. If a file path doesn’t match the specified pattern, the file is skipped. For example, if you specify the pattern .csv$ # , then a file is skipped if its path doesn’t match the pattern. Read more about regular expressions.

Skip Glacier Objects

  • select to skip processing objects stored in the Amazon Glacier storage class. If objects are stored in Glacier storage class, but this option is not checked, an exception is thrown.

Filter by Modified Time

  • choose how to filter files for ingestion:

If it is unchecked (default):

  • Start after path: inserts last_path parameter so that the first execution skips files before the path. (Ex. logs/data_20170101.csv)

  • Incremental: enables incremental loading. If incremental loading is enabled, config diff for the next execution includes the last_path parameter so that the next execution skips files before the path. Otherwise, last_path is not included.

If it is checked:

  • Modified after: inserts last_modified_time parameters so that first execution skips files that were modified before that specified timestamp (Ex. 2019-06-03T10:30:19.806Z)

  • Incremental by Modified Time: enables incremental loading by modified time. If incremental loading is enabled, config diff for the next execution includes the last_modified_time parameter so that the next execution skips files that were modified before that time. Otherwise, last_modified_time is not included.


You can limit access to your S3 bucket/IAM user by using a list of static IPs. Contact support@treasuredata.com if you need static IPs.

There are instances where you might need to scan all the files in a directory (such as from the top-level directory "/"). In such instances, you must use the CLI to do the import.

Example

Amazon CloudFront is a web service that speeds up the distribution of your static and dynamic web content. You can configure CloudFront to create log files that contain detailed information about every user request that CloudFront receives. If you enable logging, you can save CloudFront log files, shown as follows:

[your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-15.a103fd5a.gz]
[your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-15.b2aede4a.gz]
[your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-16.594fa8e6.gz]
[your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-16.d12f42f9.gz]

In this case, the Source Table settings are as shown:

BZip2 decoder plugin is supported as default. Zip Decoder Function

Data Settings

  1. Select Next.
    The Data Settings page opens.

  2. Optionally, edit the data settings or skip this page of the dialog.

Filters

Data Preview




Data Placement

Validating Your Data Connector Jobs

How do I troubleshoot data import problems?

Review the job log. Warning and errors provide information about the success of your import. For example, you can identify the source file names associated with import errors.

What can I do if the data connector for S3 job is running for a long time?

Check the count of S3 files that your connector job is ingesting. If there are over 10,000 files, the performance degrades. To mitigate this issue, you can:

Sample Workflow

There is a sample workflow file for S3 import integration. You can define the import settings using yml file, and run it using `td_load>:` workflow operator. Variable definitions that cannot be used with the Source function of the TD console alone are possible with yml file-based execution.

You can refer the sample code from https://github.com/treasure-data/treasure-boxes/tree/master/td_load/s3.

timezone: UTC

schedule:
  daily>: 02:00:00

sla:
  time: 08:00
  +notice:
    mail>: {data: Treasure Workflow Notification}
    subject: This workflow is taking long time to finish
    to: [me@example.com]

_export:
  td:
    dest_db: dest_db_ganesh
    dest_table: dest_table_ganesh

+prepare_table:
  td_ddl>:
  create_databases: ["${td.dest_db}"]
  create_tables: ["${td.dest_table}"]
  database: ${td.dest_db}

+load:
  td_load>: config/daily_load.yml
  database: ${td.dest_db}
  table: ${td.dest_table}