The import data connector for Amazon S3 enables you to import the data from JSON, TSV, and CSV files stored in S3 buckets. The key difference and benefit of Amazon S3 Import Integration v2 over v1 is the added support for assume_role authentication.
Review the information in the following table to understand the authentication differences between v2 and v1. For v1 details, see Amazon S3 Import Integration v1.
| Authentication Method | Amazon S3 v2 | Amazon S3 v1 |
|---|---|---|
| basic | x | x |
| anonymous | x | |
| session | x | x |
| assume_role | x |
- A basic knowledge of Treasure Data.
If you are using an AWS S3 bucket located in the same region as your TD region, the IP address from which TD is accessing to the bucket will be private and dynamically changing. If you would like to restrict access, please specify the ID of VPC instead of static IP Addresses. For example, if in the US region, configure access through vpc-df7066ba. If in the Tokyo region, configure access through vpc-e630c182 and, for the EU01 region, vpc-f54e6a9e.
Look up the region of TD Console by the URL you are logging in to TD, then refer to the data connector of your region in the URL.
See the API Documentation for details.
If your security policy requires IP whitelisting, you must add Treasure Data's IP addresses to your allowlist to ensure a successful connection.
Please find the complete list of static IP addresses, organized by region, at the following link:
https://api-docs.treasuredata.com/en/overview/ip-addresses-integrations-result-workers/
If you are importing a very large file, you can take advantage of the parallel import support provided by this integration. To do this, break up your large files into smaller files and then upload the smaller files simultaneously in batches. However, bear in mind that attempting to import lots of very small files will have a negative effect on performance. Consequently, Treasure Data recommends that you do not perform parallel input with file sizes smaller than 50MB. The default maximum number of parallel import threads that can be used is 16.
You can use TD Console to create your data connector.
- Open TD Console.
- Navigate to Integrations Hub > Catalog.
- Search for S3 v2andselect Amazon S3 (v2).
- Select Create Authentication.

A new Authentication dialog opens. Depending on the Authentication method you choose, the dialog may look like one of these screens:



- Configure the authentication fields, and then select Continue.
The following table describes the authentication configuration parameters for Amazon S3 Import Integration v2.
| Parameter | Description |
|---|---|
Endpoint | S3 service endpoint override. You can find region and endpoint information from the AWS service endpoints document. (Ex. s3.ap-northeast-1.amazonaws.com) When specified, it will override the region setting. |
| Region | AWS Region |
| Authentication Method |
|
| Access Key ID | AWS S3 issued |
| Secret Access Key | AWS S3 issued |
| Secret token | Session token for temporary credentials |
| TD's Instance Profile | This value is provided by the TD Console. The numeric portion of the value constitutes the Account ID that you will use when you create your IAM role. |
| Account ID | Your AWS Account ID |
| Your Role Name | Your AWS Role Name |
| External ID | Your Secret External ID |
| Duration In Seconds | Duration For The Temporary Credentials |
- Name your new AWS S3 connection, and select Done.
- Create a new authentication with the assume_role authentication method.
- Make a note of the numeric portion of the value in the TD's Instance Profile field.

- Create your AWS IAM role.


After creating the authenticated connection, you are automatically taken to Authentications.
- Search for the connection you created.
- Select New Source.

- Type a name for your Sourcein the Data Transfer field**.**
- Click Next.

The Source dialog opens.
- Edit the following parameters.

| Parameters | Description |
|---|---|
| Bucket |
|
| Path Prefix |
|
| Path Regex |
|
| Skip Glacier Objects |
|
| Filter by Modified Time |
|
| Unchecked (default): |
|
| Checked: |
|
You might need to scan all the files in a directory (such as from the top-level directory "/"). In such instances, you must use the CLI to do the import.
Example
Amazon CloudFront is a web service that speeds up the distribution of your static and dynamic web content. You can configure CloudFront to create log files that contain detailed information about every user request that CloudFront receives. If you enable logging, you can configure CloudFront to save log files as shown here:
[your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-15.a103fd5a.gz][your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-15.b2aede4a.gz][your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-16.594fa8e6.gz][your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-16.d12f42f9.gz]In this case, the Source Table settings are as shown:
- Bucket: your_bucket
- Path Prefix: logging/
- Path Regex: .gz$ (Not Required)
- Start after path: logging/E231A697YXWD39.2017-04-23-15.b2aede4a.gz (Assuming that you want to import the log files from 2017-04-23-16.)
- Incremental: true (if you want to schedule this job.)
BZip2 decoder plugin is supported as the default. See File Decoder Function.
Select Next.
The Data Settings page opens.
Optionally, edit the data settings or skip this page of the dialog.

Filters are available in the Create Source or Edit Source import settings for your S3, FTP, or SFTP connectors.
Import Integration Filters enable you to modify your imported data after you have completed Editing Data Settings for your import.
To apply import integration filters:
- Select Next in Data Settings.The Filters dialog opens.
- Select the filter option you want to add.

- Select Add Filter. The parameter dialog for that filter opens.
- Edit the parameters. For information on each filter type, see one of the following:
- Retaining Columns Filter
- Adding Columns Filter
- Dropping Columns Filter
- Expanding JSON Filter
- Digesting Filter
- Optionally, to add another filter of the same type, select Add within the specific column filter dialog.
- Optionally, to add another filter of a different type, select the filter option from the list and repeat the same steps.
- After you have added the filters you want, select **Next.**The Data Preview dialog opens.
You can see a preview of your data before running the import by selecting Generate Preview. Data preview is optional and you can safely skip to the next page of the dialog if you choose to.
- Select Next. The Data Preview page opens.
- If you want to preview your data, select Generate Preview.
- Verify the data.
For data placement, select the target database and table where you want your data placed and indicate how often the import should run.
Select Next. Under Storage, you will create a new or select an existing database and create a new or select an existing table for where you want to place the imported data.
Select a Database > Select an existing or Create New Database.
Optionally, type a database name.
Select a Table> Select an existing or Create New Table.
Optionally, type a table name.
Choose the method for importing the data.
- Append (default)-Data import results are appended to the table. If the table does not exist, it will be created.
- Always Replace-Replaces the entire content of an existing table with the result output of the query. If the table does not exist, a new table is created.
- Replace on New Data-Only replace the entire content of an existing table with the result output when there is new data.
Select the Timestamp-based Partition Key column. If you want to set a different partition key seed than the default key, you can specify the long or timestamp column as the partitioning time. As a default time column, it uses upload_time with the add_time filter.
Select the Timezone for your data storage.
Under Schedule, you can choose when and how often you want to run this query.
- Select Off.
- Select Scheduling Timezone.
- Select Create & Run Now.
- Select On.
- Select the Schedule. The UI provides these four options: @hourly, @daily and @monthly or custom cron.
- You can also select Delay Transfer and add a delay of execution time.
- Select Scheduling Timezone.
- Select Create & Run Now.
After your transfer has run, you can see the results of your transfer in Data Workbench > Databases.
The key difference and benefit of Amazon S3 Import Integration v2 over v1 is the added support for assume_role authentication. With assume_role as the authentication method, you cannot declare the authentication explicitly. Refer to Reuse the existing Authenticationfor Workflow config with Authentication reused.
A workflow can start a job with a unique id. For more information, see https://docs.digdag.io/operators/td_load.html.
Optionally, you can use the TD Toolbelt to configure the connection, create the job, and schedule job execution.
Before setting up the connector, install the most current TD Toolbelt.
If you are planning to incremental loading with the CLI and a YAML file, you will need to use a pre-existing source connector in the TD Console because the incremental load functionality persists information about the last record processed in the console.
Configure the seed.yml file as shown in the following example with your AWS access keys. You must also specify the bucket name and source file name. Optionally you can specify path_prefix to match multiple files. In the example below, path_prefix: path/to/sample_file will match
path/to/sample_201501.csv.gzpath/to/sample_201502.csv.gzpath/to/sample_201505.csv.gzetc.
Using path_prefix with leading '/', can lead to unintended results. For example: "path_prefix: /path/to/sample_file" would result in plugin looking for file in s3://sample_bucket//path/to/sample_file which is different on S3 than the intended path of s3://sample_bucket/path/to/sample_file.
in:
type: s3_v2
access_key_id: XXXXXXXXXX
secret_access_key: YYYYYYYYYY
bucket: sample_bucket
# path to the *.json or *.csv or *.tsv file on your s3 bucket
path_prefix: path/to/sample_file
path_match_pattern: \.csv$ # a file will be skipped if its path doesn't match with this pattern
## some examples of regexp:
#path_match_pattern: /archive/ # match files in .../archive/... directory
#path_match_pattern: /data1/|/data2/ # match files in .../data1/... or .../data2/... directory
#path_match_pattern: .csv$|.csv.gz$ # match files whose suffix is .csv or .csv.gz
out:
mode: appendIf you reuse an existing authentication, set the Authentication ID to the value of td_authentication_id config key. This is required for the assume-role authentication method. See Reuse the existing Authentication.
connector:guess automatically reads the source files and assesses the file format and the fields and columns.
td connector:guess seed.yml -o load.ymlIf you look at the load.yml file, you can see the "guessed" file format definitions, including file formats, encodings, column names, and types.
in:
type: s3_v2
access_key_id: XXXXXXXXXX
secret_access_key: YYYYYYYYYY
bucket: sample_bucket
path_prefix: path/to/sample_file
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
escape: ''
skip_header_lines: 1
columns:
- name: id
type: long
- name: company
type: string
- name: customer
type: string
- name: created_at
type: timestamp
format: '%Y-%m-%d %H:%M:%S'
out:
mode: appendYou can see a preview of the data using the td connector:preview command.
td connector:preview load.ymlThe connector:guess needs more than three rows and two columns in the source data file because the command assesses the column definition using sample rows from source data.
If the system detects your column name or column type unexpectedly, modify load.yml directly and preview again.
Currently, the Data Connector supports parsing of "boolean" "long" "double" "string" and "timestamp" types.
Submit the load job. It may take a couple of hours, depending on the size of the data. Specify the Treasure Data database and table where the data should be stored.
It's also recommended to specify s --time-column option because Treasure Data's storage is partitioned by time (see data partitioning). If the option is not provided, the data connector chooses the first long or timestamp column as the partitioning time. The type of the column specified by --time-column must be either of type long or timestamp.
If your data doesn't have a time column you can add a time column by using add_time filter option. For more details see add_time filter plugin.
$ td connector:issue load.yml --database td_sample_db --table td_sample_table \ --time-column created_atIn the example below, the connector:issue command assumes that you have already created a *database(td_sample_db)*and a table(td_sample_table). If the database or the table do not exist in TD, this command will fail. Create the database and table manually or use --auto-create-table option with td connector:issue command to auto-create the database and table:
$ td connector:issue load.yml --database td_sample_db --table td_sample_table --time-column created_at --auto-create-tableThe data connector does not sort records on the server-side. To use time-based partitioning effectively, sort records in files beforehand.
If you have a field called time, you don't have to specify the --time-column option.
td connector:issue load.yml --database td_sample_db --table td_sample_tableYou can specify file import mode in the out section of the load.yml file.
The out: section controls how data is imported into a Treasure Data table.
For example, you may choose to append data or replace data in an existing table in Treasure Data.
| Mode | Description | Examples |
|---|---|---|
| Append | Records are appended to the target table. | in: ...out: mode: append |
| Always Replace | Replaces data in the target table. Any manual schema changes made to the target table remain intact. | in: ...out: mode: replace |
| Replace on new data | Replaces data in the target table only when there is new data to import. | in: ...out: mode: replace_on_new_data |
You can schedule periodic data connector execution for incremental file import. We configure our scheduler carefully to ensure high availability.
For the scheduled import, you can import all files that match the specified prefix and one of these fields by condition:
- If use_modified_time is disabled, the last path is saved for the next execution. On the second and subsequent runs, the connector only imports files that come after the last path in alphabetical order.
- Otherwise, the time that the job is executed is saved for the next execution. On the second and subsequent runs, the connector only imports files that were modified after that execution time in alphabetical order.
A new schedule can be created using the td connector:create command.
$ td connector:create daily_import "10 0 * * *" \ td_sample_db td_sample_table load.ymlIt's also recommended to specify the --time-column option, because Treasure Data's storage is partitioned by time (see also data partitioning).
$ td connector:create daily_import "10 0 * * *" \ td_sample_db td_sample_table load.yml \ --time-column created_atThe cron parameter also accepts three special options: @hourly, @daily, and @monthly.
By default, the schedule is setup in the UTC timezone. You can set the schedule in a timezone using -t or --timezone option. --timezone option supports only extended timezone formats like 'Asia/Tokyo', 'America/Los_Angeles' etc. Timezone abbreviations like PST, CST are *not* supported and may lead to unexpected schedules.
You can see the list of currently scheduled entries by running the command td connector:list.
$ td connector:listtd connector:show daily_importNametd connector:history shows the execution history of a schedule entry. To investigate the results of each individual run, use td job jobid.
td connector:historytd connector:delete removes the schedule.
td connector:delete daily_importThe IAM credentials specified in the YML configuration file, which are used for the connector:guess and connector:issue commands, need to have permissions for the AWS S3 resources that they need to access. If the IAM user does not have these permissions, configure the user with one of the predefined Policy Definitions or create a new Policy Definition in JSON format.
The following example is based on the Policy Definition reference format. It gives the IAM user read only permissions (through GetObject and ListBucket actions) to "your-bucket."
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::your-bucket",
"arn:aws:s3:::your-bucket/*"
]
}
]
}Replace "your-bucket" with the actual name of your S3 bucket.
In certain cases, IAM basic authentication through access_key_id and secret_access_key might be too risky (even though the secret_access_key is never clearly shown when a job is executed or after a session is created).
The S3 data connector can use AWS Secure Token Service (STS) to provide Temporary Security Credentials. Using AWS STS, any IAM user can use his own access_key_id and secret_access_key to create these temporary keys with specific expiration times :
- new_access_key_id
- new_secret_access_key
- session_token keys
The following are types of Temporary Security Credentials:
The simplest Security Credentials with a specified expiration time. The temporary credentials have the same access as the IAM user the that generated them. These credentials are valid as long as they are not expired and the permissions of the original IAM user have not changed.
This adds an extra layer of permission control over the Session Token above. When generating a Federation Token, the IAM user is required to specify a Permission Policy definition. The scope can be used to restrict which resources the bearer of the Federation Token can have access to (which can be less that the access of IAM user granting the permission). Any Permission Policy definition can be used, but the scope of the permissions is limited to the same, or a subset of, permissions of the IAM who generated the token. As for the Session Token, the Federation Token credentials are valid as long as they are not expired and the permissions associated to the original IAM credentials don't change.
AWS STS Temporary Security Credentials can be generated using the AWS CLI or the AWS SDK in the language of your choice.
aws sts get-session-token --duration-seconds 900In this example,
temp_credsis the name of the Federated token or the user's temp credentials.bucketnameis the name of the S3 bucket being granted access. (Refer to the ARN specification for more details)s3:GetObjectands3:ListBucketare the basic read operation for a AWS S3 bucket.
aws sts get-federation-token --name temp_creds --duration-seconds 900 \
--policy '{"Statement": [{"Effect": "Allow", "Action": ["s3:GetObject", "s3:ListBucket"], "Resource": "arn:aws:s3:::bucketname"}]}'AWS STS credentials cannot be revoked. They will remain effective until expired, or until you delete or remove the permissions of the original IAM user used to generate the credentials.
When your Temporary Security Credentials are generated, include the SecretAccessKey, AccessKeyId, and SessionToken in your seed.yml file and execute the Data Connector for S3 as usual..
in:
type: s3_v2
auth_method: session
access_key_id: XXXXXXXXXX
secret_access_key: YYYYYYYYYY
session_token: ZZZZZZZZZZ
bucket: sample_bucket
path_prefix: path/to/sample_fileBecause STS credentials expire after a specified amount of time, the data connector job that uses the credential might eventually start failing. Currently, if the STS credentials are reported expired, the data connector job retries up to the maximum number of times (5) and eventually completes with a statis pf "error."
To confirm the import, see the steps in Validating Your Data Connector Jobs.
This feature allows you reuse the existing authentication defined in the TD console UI.
Follow the steps in Importing from AWS S3 using TD Consoleto create an authentication.
Navigate to the Integrations Hub > Authentications screen
Click on the saved Authentication.
The Authentication ID is the number shown on the browser URL

Use the config key td_authentication_id with the Authentication ID above to create configurations for TD Workflow or CLI (Toolbelt).
Example of configurations with an Authentication reuse
+import_from_s3_assume_role_with_existing_connection:
td_load>: cfg_load.yml
database: test_db
table: test_tbl
## cfg_load.yml
in:
type: s3_v2
bucket: sample_bucket
path_prefix: path/to/sample_file
td_authentication_id: 287355
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ","
quote: "\""
escape: "\""
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- name: col_1
type: string
- name: col_2
type: string
Example seed config (seed.yml)
in:
type: s3_v2
td_authentication_id: 287355
bucket: sample_bucket
path_prefix: path/to/sample_file