Skip to content
Last updated

Amazon S3 Parquet Export Integration

Parquet has a columnar data format that is particularly beneficial for analytics workloads where queries often need to access only a subset of the columns in a dataset, making it a favored option in numerous data pipeline strategies.

The Amazon S3 parquet export integration lets you generate parquet files from Treasure Data job results and upload them directly to Amazon S3.

Prerequisites

  • Basic knowledge of Treasure Data.

S3 Bucket Policy Configuration

If you are using an AWS S3 bucket located in the same region as your TD region, the IP address from which TD is accessing to the bucket will be private and dynamically changing. If you would like to restrict access, please specify the ID of VPC instead of static IP Addresses. For example, if in the US region, configure access through vpc-df7066ba. If in the Tokyo region, configure access through vpc-e630c182 and, for the EU01 region, vpc-f54e6a9e.

Look up the region of TD Console by the URL you are logging in to TD, then refer to the data connector of your region in the URL.

See the API Documentation for details.

Static IP Address of Treasure Data Integration

If your security policy requires IP whitelisting, you must add Treasure Data's IP addresses to your allowlist to ensure a successful connection.

Please find the complete list of static IP addresses, organized by region, at the following link:
https://api-docs.treasuredata.com/en/overview/ip-addresses-integrations-result-workers/

Requirements and Limitations

Create a connection on the TD Console

You must create and configure the data connection before running your query. As part of the data connection, you provide authentication to access the integration.

Supported Authentication methods for Amazon S3 Parquet

Authentication MethodAmazon S3 Parquet
basicx
sessionx
assume_rolex

Create Authentication

Your first step is to create a new authentication with a set of credentials.

  1. Select Integrations Hub.
  2. Select Catalog.
  3. Search for your Integration in the Catalog; hover your mouse over the icon and select Create Authentication.
  4. Ensure that the Credentials tab is selected and then enter credential information for the integration. The Authentication fields
ParameterDescription
EndpointS3 service endpoint override. You can find region and endpoint information in AWS service endpoints. (Ex. s3.ap-northeast-1.amazonaws.com) When specified, it will override the region setting.
RegionAWS Region
Authentication Method
basic
  • Uses access_key_id and secret_access_key to authenticate. See AWS Programmatic access.
  • Access Key ID
  • Secret access key
session (Recommended)
  • Uses temporary-generated access_key_id, secret_access_key, and session_token.
  • Access Key ID
  • Secret access key
  • Session token
assume_role
  • Uses role access. See AWS AssumeRole.
  • TD's Instance Profile
  • Account ID
  • Your Role Name
  • External ID
  • Duration In Seconds
anonymousNot Supported
Access Key IDAWS S3 issued
Secret Access KeyAWS S3 issued
Session TokenYour temporary AWS Session Token
TD's Instance ProfileThis value is provided by the TD Console. The numeric portion of the value constitutes the Account ID that you will use when you create your IAM role.
Account IDYour AWS Account ID
Your Role NameYour AWS Role Name
External IDYour Secret External ID
DurationDuration For The Temporary Credentials
  1. Select Continue.
  2. Enter a name for your authentication, and select Done.

Create an Authentication with the assume_role authentication method

  1. Create a new authentication with the assume_role authentication method.
  2. Make a note of the numeric portion of the value in the TD's Instance Profile field.
  3. Create your AWS IAM role.

Configure a Query result exporting

Define your Query

  1. Navigate to Data Workbench > Queries.
  2. Select New Query.
  3. Run the query to validate the result set.

Specify the Result Export Target

  1. Select Export Results.
  2. Specify an Authentication, and start configuring the export. You can select an existing authentication or create a new authentication for the external service to be used for output. Choose one of the following:

Use an existing Integration

Create a new Integration

**

Configuration fields

FieldDescription
Server-side encryptionSupport values: - sse-s3: Server-side Encryption Mode - sse-kms: new SSE Mode
Server-side Encryption AlgorithmSupport value: - AES256
KMS Key IDSymmetric AWS KMS Key ID. If there is no input for the KMS Key ID, it will create/use the default KMS Key.
BucketProvide the S3 bucket name (Ex., your_bucket_name).
CompressionProvide the S3 bucket name (Ex., your_bucket_name).
PathSpecify the s3 filename (object key), and include an extension (Ex. test.parquet).
CompressionThe compression format of the exported files. Support value: - None - Gzip - Snappy
Row group sizeSpecify the group size of the parquet file
Page sizeSpecify the page size of the parquet file
Timestamp UnitSpecify the timestamp format for parquet file. Supported time units: - Milliseconds - Microseconds
Single fileUncheck it to split into multiple files with the size of a Row Group
Enable Bloom filterEnable Bloom filter for parquet file
Retry limitMaximum retry times
Retry waitWaiting time in ms for each retry time
Number of concurrent threadsNumber of concurrent threads for S3 threads to upload
Connections per threadsNumber of HTTP connections per thread open to S3
Part sizePart size of multipart upload for S3

Create an Activation on Audience Studio

You can create an activation to export segment data or stages.

(Optional) Schedule Query Export Jobs

You can use Scheduled Jobs with Result Export to periodically write the output result to a target destination that you specify.

Treasure Data's scheduler feature supports periodic query execution to achieve high availability.

When two specifications provide conflicting schedule specifications, the specification requesting to execute more often is followed while the other schedule specification is ignored.

For example, if the cron schedule is '0 0 1 * 1', then the 'day of month' specification and 'day of week' are discordant because the former specification requires it to run every first day of each month at midnight (00:00), while the latter specification requires it to run every Monday at midnight (00:00). The latter specification is followed.

Scheduling your Job Using TD Console

  1. Navigate to Data Workbench > Queries

  2. Create a new query or select an existing query.

  3. Next to Schedule, select None.

  4. In the drop-down, select one of the following schedule options:

    Drop-down ValueDescription
    Custom cron...Review Custom cron... details.
    @daily (midnight)Run once a day at midnight (00:00 am) in the specified time zone.
    @hourly (:00)Run every hour at 00 minutes.
    NoneNo schedule.

Custom cron... Details

Cron ValueDescription
0 * * * *Run once an hour.
0 0 * * *Run once a day at midnight.
0 0 1 * *Run once a month at midnight on the morning of the first day of the month.
""Create a job that has no scheduled run time.
 *    *    *    *    *
 -    -    -    -    -
 |    |    |    |    |
 |    |    |    |    +----- day of week (0 - 6) (Sunday=0)
 |    |    |    +---------- month (1 - 12)
 |    |    +--------------- day of month (1 - 31)
 |    +-------------------- hour (0 - 23)
 +------------------------- min (0 - 59)

The following named entries can be used:

  • Day of Week: sun, mon, tue, wed, thu, fri, sat.
  • Month: jan, feb, mar, apr, may, jun, jul, aug, sep, oct, nov, dec.

A single space is required between each field. The values for each field can be composed of:

Field ValueExampleExample Description
A single value, within the limits displayed above for each field.
A wildcard '*' to indicate no restriction based on the field.'0 0 1 * *'Configures the schedule to run at midnight (00:00) on the first day of each month.
A range '2-5', indicating the range of accepted values for the field.'0 0 1-10 * *'Configures the schedule to run at midnight (00:00) on the first 10 days of each month.
A list of comma-separated values '2,3,4,5', indicating the list of accepted values for the field.0 0 1,11,21 * *'Configures the schedule to run at midnight (00:00) every 1st, 11th, and 21st day of each month.
A periodicity indicator '*/5' to express how often based on the field's valid range of values a schedule is allowed to run.'30 */2 1 * *'Configures the schedule to run on the 1st of every month, every 2 hours starting at 00:30. '0 0 */5 * *' configures the schedule to run at midnight (00:00) every 5 days starting on the 5th of each month.
A comma-separated list of any of the above except the '*' wildcard is also supported '2,*/5,8-10'.'0 0 5,*/10,25 * *'Configures the schedule to run at midnight (00:00) every 5th, 10th, 20th, and 25th day of each month.
  1. (Optional) You can delay the start time of a query by enabling the Delay execution.

(Optional) Configure Export Results in Workflow

Within Treasure Workflow, you can specify the use of this integration to export data.

_export:
  td:
  database: td.database

+a_s3_parquet_export_task:
  td>: export_test.sql
  database: ${td.database}
  result_connection: s3_parquet_conn
  result_settings:
  type:

(Optional) Export Integration Using the CLI

To output the result of a single query to an S3 bucket, add the --result option to the td query command. After the job is finished, the results are written into your S3 bucket. About the td query command, you can refer to this article.

You can specify detailed settings to export your S3 via the --result parameter as follows:

{
  "type": "s3_parquet",
  "endpoint": "",
  "region": "us-east-1",
  "auth_method": "",
  "session_token": "",
  "account_id": "",
  "role_name": "",
  "duration_in_seconds": 3600,
  "access_key_id": "xxxxx",
  "secret_access_key": "xxxxx",
  "sse_type": "sse-s3",
  "sse_algorithm": "AES256",
  "kms_key_id": "arn:aws:kms:us-east-1:xxx:key/xxx",
  "bucket": "bucket",
  "path": "file.parquet",
  "row_group_size": 256,
  "page_size": 1024,
  "timestamp_unit": "milliseconds",
  "enable_bloom_filter": false,
  "enable_single_file": true,
  "compression": "gzip",
  "part_size": 100,
  "retry_limit": 7,
  "retry_wait_millis": 500,
  "number_of_concurrent_threads": 4,
  "connections_per_thread": 128,
}

Parameters

NameDescriptionValueDefault ValueRequired
typeDescribe the name of the service as the destination of export.s3_parquetN/AYes
endpointS3 service endpoint override. You can find region and endpoint information in AWS service endpoints. (Ex. s3.ap-northeast-1.amazonaws.com)S3 service endpoint overrideN/ANo
regionAWS RegionAWS Regionus-east-1No
auth_methodAuthentication method for S3basic session assume_rolebasicYes
access_key_idAccess Key IDKey IDN/AIf you use auth_method as basic or session
secret_access_keySecret Access Key​  Secret access keyN/AIf you use auth_method as basic or session
session_tokenSession TokenSession TokenN/AIf you use auth_method as session
account_idAccount IDAccount IDN/AIf you use auth_method as assume_role
role_nameRole NameRole NameN/AIf you use auth_method as assume_role
external_idExternal IDExternal IDN/AIf you use auth_method as assume_role
duration_in_secondsDuration of the connectionDuration in seconds3600No
bucketS3 bucket valueS3 bucket valueN/AYes
pathS3 path with file format, ex: "/path/file.parquet"Path fileN/AYes
row_group_sizeParquet row group sizeParquet row group size256No
page_sizeParquet page sizeParquet page size1024No
timestamp_unitParquet timestamp format- milliseconds - microsecondsmillisecondsNo
enable_bloom_filterEnable bloom filter for parquet file- true - falsefalseNo
enable_single_fileEnable export to a single file or multiple files with one row_group_file in each file- true - falsetrueY
compressionCompression format of the exported file.- None - Gzip - SnappyNoneNo
retry_limitMaximum retry timesRetry limit7No
retry_wait_millisWaiting time in ms for each retry timeRetry wait in ms500No
number_of_concurrent_threadsNumber of concurrent threads for S3 threads to uploadConcurrent S3 threads4No
connection_per_threadNumber of HTTP connections per thread open to S3Connection HTTP per thread16No
part_sizePart size of multiple part upload for S3Part size100No

Example for Usage

$ td query \
--result '{"type":"s3_parquet","auth_method":"basic","access_key_id":"access_key_id","secret_access_key":"secret_access_key","bucket":"bucket","path":"/path/file.parquet"}' \
-d sample_datasets "select * from www_access" -T presto