Learn more about Amazon S3 Export Integration.
The data connector for Amazon S3 enables you to import the data from your JSON, TSV, and CSV files stored in an S3 bucket.
For sample workflows on importing data from files stored in an S3 bucket, go to the Treasure Box on Github.
An update to provide support for AssumeRole is coming in Spring 2022. |
Prerequisites
You must have basic knowledge of Treasure Data.
You must set up an access route in AWS if you are using an AWS S3 bucket located in the same region as your TD region. You set up the access route by specifying the VPC. For example, if in the US region, configure access through vpc-df7066ba. If in the Tokyo region, configure access through vpc-e630c182 and, for the EU01 region, vpc-f54e6a9e.
Look up the region of TD Console by the URL you are logging in to TD, then refer to the data connector of your region in the URL.
Region of TD Console | URL |
US | |
Tokyo | |
EU01 |
Use the TD Console to Create Your Connection
You can use TD Console to create your data connector.
Create a New Connection
When you configure a data connection, you provide authentication to access the integration. In Treasure Data, you configure the authentication and then specify the source information.
Navigate to Integrations Hub > Catalog and search for AWS S3.
Select Create Authentication.
New Authentication dialog opens. You need a Access key ID and a Secret access key to authenticate using credentials.
Set the following parameters. Select Continue. Name your new AWS S3 connection. Select Done.
Endpoint |
|
Authentication Method | |
basic |
|
anonymous |
|
session |
|
Access Key ID | AWS S3 issued |
Secret Access Key | AWS S3 issued |
Transfer Your AWS S3 Data to Treasure Data
After creating the authenticated connection, you are automatically taken to Authentications.
Search for the connection you created.
Select New Source.
Connection
Type a name for your Source in the Data Transfer field.
Click Next.
Source Table
The Source dialog opens. Edit the following parameters
Parameters | Description |
---|---|
Bucket |
|
Path Prefix |
|
Path Regex |
|
Skip Glacier Objects |
|
Filter by Modified Time |
|
If it is unchecked (default): |
|
If it is checked: |
|
You can limit access to your S3 bucket/IAM user by using a list of static IPs. Contact support@treasuredata.com if you need static IPs. |
There are instances where you might need to scan all the files in a directory (such as from the top-level directory "/"). In such instances, you must use the CLI to do the import.
Example
Amazon CloudFront is a web service that speeds up the distribution of your static and dynamic web content. You can configure CloudFront to create log files that contain detailed information about every user request that CloudFront receives. If you enable logging, you can save CloudFront log files, shown as follows:
[your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-15.a103fd5a.gz] [your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-15.b2aede4a.gz] [your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-16.594fa8e6.gz] [your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-16.d12f42f9.gz]
In this case, the Source Table settings are as shown:
Bucket: your_bucket
Path Prefix: logging/
Path Regex: .gz$ (Not Required)
Start after path: logging/E231A697YXWD39.2017-04-23-15.b2aede4a.gz (Assuming that you want to import the log files from 2017-04-23-16.)
Incremental: true (if you want to schedule this job.)
BZip2 decoder plugin is supported as default. Zip Decoder Function |
Data Settings
Select Next.
The Data Settings page opens.Optionally, edit the data settings or skip this page of the dialog.
Filters
Import Integration Filters enable you to modify your imported data after you have completed Editing Data Settings for your import. To apply import integration filters: The Filters dialog opens. The parameter dialog for that filter opens. For information on each filter type, see one of the following: Select Next in Data Settings.
Select the filter option you want to add.
Select Add Filter.
Edit the parameters.
—Retaining Columns Filter
—Adding Columns Filter
—Dropping Columns Filter
—Expanding JSON Filter
—Digesting FilterOptionally, to add another filter of the same type, select Add within the specific column filter dialog.
Optionally, to add another filter of a different type, select the filter option from the list and repeat the same steps.
After you have added the filters you want, select Next.
The Data Preview dialog opens.
Data Preview
You can see a preview of your data before running the import by selecting Generate Preview. Data shown in the data preview is approximated from your source. It is not the actual data that is imported. Select Next. To preview your data, select Generate Preview. Optionally, select Next. Verify that the data looks approximately like you expect it to. Select Next.
Data preview is optional and you can safely skip to the next page of the dialog if you want.
Data Placement
For data placement, select the target database and table where you want your data placed and indicate how often the import should run. Select Next. Under Storage you will create a new or select an existing database and create a new or select an existing table for where you want to place the imported data. Select a Database > Select an existing or Create New Database. Optionally, type a database name. Select a Table> Select an existing or Create New Table. Optionally, type a table name. Choose the method for importing the data. Append (default)-Data import results are appended to the table. Always Replace-Replaces the entire content of an existing table with the result output of the query. If the table does not exist, a new table is created. Replace on New Data-Only replace the entire content of an existing table with the result output when there is new data. Select the Timestamp-based Partition Key column. Select the Timezone for your data storage. Under Schedule, you can choose when and how often you want to run this query. Select Off. Select Scheduling Timezone. Select Create & Run Now. Repeat the query: Select On. Select the Schedule. The UI provides these four options: @hourly, @daily and @monthly or custom cron. You can also select Delay Transfer and add a delay of execution time. Select Scheduling Timezone. Select Create & Run Now. After your transfer has run, you can see the results of your transfer in Data Workbench > Databases.
If the table does not exist, it will be created.
If you want to set a different partition key seed than the default key, you can specify the long or timestamp column as the partitioning time. As a default time column, it uses upload_time with the add_time filter.
Validating Your Data Connector Jobs
How do I troubleshoot data import problems?
Review the job log. Warning and errors provide information about the success of your import. For example, you can identify the source file names associated with import errors.
To find out more about a specific job, you can select that job and see details. Depending on the type of job, you can see some or all of the following: results, query, output logs, engine logs, details, and destination. Open the TD Console. Navigate to Jobs. You can review the number of jobs which is listed in the upper right of the page. Optionally, use filters to reduce the listing of jobs to locate what you are interested in. Including filtering by job owner, date, and database name. Select a job to open it and view results, query definition, logs, and other details. Each tab has different information about the job. Results View the imported data from the job. From here you can copy the results to the clipboard or download them as a CSV file. Query View the query syntax of the job Launch a query editor Copy queries and use to create new queries or workflows Refine queries to improve efficiency Output and Engine Logs Log information can be reviewed for run times, query result numbers, and error codes Log information can be copied to the clipboard Details View further details: query name type job id status duration scheduled and actual times result count and size runner, database queried priority Destination Here you can view details of an export integration configuration: integration type settings
What can I do if the data connector for S3 job is running for a long time?
Check the count of S3 files that your connector job is ingesting. If there are over 10,000 files, the performance degrades. To mitigate this issue, you can:
Narrow path_prefix option and reduce the count of S3 files.
Set 268,435,456 (256MB) to min_task_size option.
Sample Workflow
There is a sample workflow file for S3 import integration. You can define the import settings using yml file, and run it using `td_load>:` workflow operator. Variable definitions that cannot be used with the Source function of the TD console alone are possible with yml file-based execution.
You can refer the sample code from https://github.com/treasure-data/treasure-boxes/tree/master/td_load/s3.
timezone: UTC schedule: daily>: 02:00:00 sla: time: 08:00 +notice: mail>: {data: Treasure Workflow Notification} subject: This workflow is taking long time to finish to: [me@example.com] _export: td: dest_db: dest_db_ganesh dest_table: dest_table_ganesh +prepare_table: td_ddl>: create_databases: ["${td.dest_db}"] create_tables: ["${td.dest_table}"] database: ${td.dest_db} +load: td_load>: config/daily_load.yml database: ${td.dest_db} table: ${td.dest_table}