Page tree
Skip to end of metadata
Go to start of metadata

Learn more about Amazon S3 Export Integration.

The data connector for Amazon S3 enables you to import the data from your JSON, TSV, and CSV files stored in an S3 bucket.

For sample workflows on importing data from files stored in an S3 bucket, go to the Treasure Box on Github.

Prerequisites

You must have basic knowledge of Treasure Data.

You must set up an access route in AWS if you are using an AWS S3 bucket located in the same region as your TD region. You set up the access route by specifying the VPC. For example, if in the US region, configure access through vpc-df7066ba. If in the Tokyo region, configure access through vpc-e630c182 and, for the EU01 region, vpc-f54e6a9e.

Look up the region of TD Console by the URL you are logging in to TD, then refer to the data connector of your region in the URL.

Use the TD Console to Create Your Connection

You can use TD Console to create your data connector.

Create a New Connection

When you configure a data connection, you provide authentication to access the integration. In Treasure Data, you configure the authentication and then specify the source information.

  1. Navigate to Integrations Hub > Catalog and search for AWS S3.


  2. Select Create.

  3. New Authentication dialog opens. You need a client ID and access keys to authenticate using credentials.


  4. Set the following parameters. Select Continue. Name your new AWS S3 connection. Select Done.



Endpoint

Authentication Method


basic

  • Uses access_key_id and secret_access_key to authenticate. See AWS Programmatic access.

    • Access Key ID

    • Secret access key

anonymous

  • Uses anonymous access. This auth method can access only public files.

session

  • Uses temporary-generated access_key_id, secret_access_key and session_token. (This authentication method is only available with data import. This can't be used with data export for now.)

    • Access Key ID

    • Secret access key

    • Secret token

Access Key ID

AWS S3 issued

Secret Access Key

AWS S3 issued


Transfer Your AWS S3 Data to Treasure Data

After creating the authenticated connection, you are automatically taken to Authentications.

  1. Search for the connection you created. 

  2. Select New Source.


Connection

  1. Type a name for your Source in the Data Transfer field.

  2. Click Next

Source Table

  1. The Source dialog opens. Edit the following parameters


Parameters

Description

Bucket

  • provide the S3 bucket name (Ex. your_bucket_name)

Path Prefix

  • specify a prefix for target keys. (Ex. logs/data_)

Path Regex

  • use regexp to match file paths. If a file path doesn’t match the specified pattern, the file is skipped. For example, if you specify the pattern .csv$ # , then a file is skipped if its path doesn’t match the pattern. Read more about regular expressions.

Skip Glacier Objects

  • select to skip processing objects stored in the Amazon Glacier storage class. If objects are stored in Glacier storage class, but this option is not checked, an exception is thrown.

Filter by Modified Time

  • choose how to filter files for ingestion:

If it is unchecked (default):

  • Start after path: inserts last_path parameter so that the first execution skips files before the path. (Ex. logs/data_20170101.csv)

  • Incremental: enables incremental loading. If incremental loading is enabled, config diff for the next execution includes the last_path parameter so that the next execution skips files before the path. Otherwise, last_path is not included.

If it is checked:

  • Modified after: inserts last_modified_time parameters so that first execution skips files that were modified before that specified timestamp (Ex. 2019-06-03T10:30:19.806Z)

  • Incremental by Modified Time: enables incremental loading by modified time. If incremental loading is enabled, config diff for the next execution includes the last_modified_time parameter so that the next execution skips files that were modified before that time. Otherwise, last_modified_time is not included.

You can limit access to your S3 bucket/IAM user by using a list of static IPs. Contact support@treasuredata.com if you need static IPs.

There are instances where you might need to scan all the files in a directory (such as from the top-level directory "/"). In such instances, you must use the CLI to do the import.

Example

Amazon CloudFront is a web service that speeds up the distribution of your static and dynamic web content. You can configure CloudFront to create log files that contain detailed information about every user request that CloudFront receives. If you enable logging, you can save CloudFront log files, shown as follows:

[your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-15.a103fd5a.gz]
[your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-15.b2aede4a.gz]
[your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-16.594fa8e6.gz]
[your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-16.d12f42f9.gz]

In this case, the Source Table settings are as shown:

  • Bucket: your_bucket

  • Path Prefix: logging/

  • Path Regex: .gz$ (Not Required)

  • Start after path: logging/E231A697YXWD39.2017-04-23-15.b2aede4a.gz (Assuming that you want to import the log files from 2017-04-23-16.)

  • Incremental: true (if you want to schedule this job.)

BZip2 decoder plugin is supported as default. Zip Decoder Function

Data Settings

  1. Select Next.
    The Data Settings page opens.

  2. Optionally, edit the data settings or skip this page of the dialog.

Filters


Import Integration Filters enable you to modify your imported data after you have completed Editing Data Settings for your import.

To apply import integration filters:

Select Next in Data Settings.

The Filters dialog opens.

Select the filter option you want to add.


Select Add Filter.

The parameter dialog for that filter opens.

Edit the parameters.

For information on each filter type, see one of the following:
Retaining Columns Filter
Adding Columns Filter
Dropping Columns Filter
Expanding JSON Filter
Digesting Filter

Optionally, to add another filter of the same type, select Add within the specific column filter dialog.
Optionally, to add another filter of a different type, select the filter option from the list and repeat the same steps.
After you have added the filters you want, select Next.
The Data Preview dialog opens.

Data Preview


You can see a preview of your data before running the import by selecting Generate Preview.

Data shown in the data preview is approximated from your source. It is not the actual data that is imported.

  1. Select Next.
    Data preview is optional and you can safely skip to the next page of the dialog if you want.

  2. To preview your data, select Generate Preview. Optionally, select Next

  3. Verify that the data looks approximately like you expect it to.


  4. Select Next.



Data Placement

For data placement, select the target database and table where you want your data placed and indicate how often the import should run.

  1.  Select Next. Under Storage you will create a new or select an existing database and create a new or select an existing table for where you want to place the imported data.

  2. Select a Database > Select an existing or Create New Database.

  3. Optionally, type a database name.

  4. Select a Table> Select an existing or Create New Table.

  5. Optionally, type a table name.

  6. Choose the method for importing the data.

    • Append (default)-Data import results are appended to the table.
      If the table does not exist, it will be created.

    • Always Replace-Replaces the entire content of an existing table with the result output of the query. If the table does not exist, a new table is created. 

    • Replace on New Data-Only replace the entire content of an existing table with the result output when there is new data.

  7. Select the Timestamp-based Partition Key column.
    If you want to set a different partition key seed than the default key, you can specify the long or timestamp column as the partitioning time. As a default time column, it uses upload_time with the add_time filter.

  8. Select the Timezone for your data storage.

  9. Under Schedule, you can choose when and how often you want to run this query.

    • Run once:
      1. Select Off.

      2. Select Scheduling Timezone.

      3. Select Create & Run Now.

    • Repeat the query:

      1. Select On.

      2. Select the Schedule. The UI provides these four options: @hourly, @daily and @monthly or custom cron.

      3. You can also select Delay Transfer and add a delay of execution time.

      4. Select Scheduling Timezone.

      5. Select Create & Run Now.

 After your transfer has run, you can see the results of your transfer in Data Workbench > Databases.

Validating Your Data Connector Jobs

How do I troubleshoot data import problems?

Review the job log. Warning and errors provide information about the success of your import. For example, you can identify the source file names associated with import errors.

To find out more about a specific job, you can select that job and see details. Depending on the type of job, you can see some or all of the following: results, query, output logs, engine logs, details, and destination.

  1. Open the TD Console.

  2. Navigate to Jobs. You can review the number of jobs which is listed in the upper right of the page.


  3. Optionally, use filters to reduce the listing of jobs to locate what you are interested in. Including filtering by job owner, date, and database name.

  4. Select a job to open it and view results, query definition, logs, and other details.


  5. Each tab has different information about the job.



Results

  • View the imported data from the job.

  • From here you can copy the results to the clipboard or download them as a CSV file.

Query

  • View the query syntax of the job

  • Launch a query editor

  • Copy queries and use to create new queries or workflows

  • Refine queries to improve efficiency

Output and Engine Logs

  • Log information can be reviewed for run times, query result numbers, and error codes

  • Log information can be copied to the clipboard

Details

View further details:

  • query name

  • type

  • job id

  • status

  • duration

  • scheduled and actual times

  • result count and size

  • runner,

  • database queried

  • priority

Destination

Here you can view details of an export integration configuration:

  • integration

  • type

  • settings

What can I do if the data connector for S3 job is running for a long time?

Check the count of S3 files that your connector job is ingesting. If there are over 10,000 files, the performance degrades. To mitigate this issue, you can:

  • Narrow path_prefix option and reduce the count of S3 files.

  • Set 268,435,456 (256MB) to min_task_size option.

Sample Workflow

There is a sample workflow file for S3 import integration. You can define the import settings using yml file, and run it using `td_load>:` workflow operator. Variable definitions that cannot be used with the Source function of the TD console alone are possible with yml file-based execution.

You can refer the sample code from https://github.com/treasure-data/treasure-boxes/tree/master/td_load/s3.

timezone: UTC

schedule:
  daily>: 02:00:00

sla:
  time: 08:00
  +notice:
    mail>: {data: Treasure Workflow Notification}
    subject: This workflow is taking long time to finish
    to: [me@example.com]

_export:
  td:
    dest_db: dest_db_ganesh
    dest_table: dest_table_ganesh

+prepare_table:
  td_ddl>:
  create_databases: ["${td.dest_db}"]
  create_tables: ["${td.dest_table}"]
  database: ${td.dest_db}

+load:
  td_load>: config/daily_load.yml
  database: ${td.dest_db}
  table: ${td.dest_table}
  • No labels