# Amazon S3 Parquet Export Integration Parquet has a columnar data format that is particularly beneficial for analytics workloads where queries often need to access only a subset of the columns in a dataset, making it a favored option in numerous data pipeline strategies. The Amazon S3 parquet export integration lets you generate parquet files from Treasure Data job results and upload them directly to Amazon S3. ## Prerequisites - Basic knowledge of Treasure Data. ## S3 Bucket Policy Configuration If you are using an AWS S3 bucket located in the same region as your TD region, the IP address from which TD is accessing to the bucket will be private and dynamically changing. If you would like to restrict access, please specify the ID of VPC instead of static IP Addresses. For example, if in the US region, configure access through vpc-df7066ba. If in the Tokyo region, configure access through vpc-e630c182 and, for the EU01 region, vpc-f54e6a9e. Look up the region of TD Console by the URL you are logging in to TD, then refer to the data connector of your region in the URL. See the [API Documentation](https://api-docs.treasuredata.com/en/overview/ip-addresses-integrations-result-workers/#s3-bucket-policy-configuration-for-export-and-import-integrations) for details. ## Static IP Address of Treasure Data Integration If your security policy requires IP whitelisting, you must add Treasure Data's IP addresses to your allowlist to ensure a successful connection. Please find the complete list of static IP addresses, organized by region, at the following link: [https://api-docs.treasuredata.com/en/overview/ip-addresses-integrations-result-workers/](https://api-docs.treasuredata.com/en/overview/ip-addresses-integrations-result-workers/) ## Requirements and Limitations - Only supports maximum row group size of 512 MB. - As multipart upload is used to upload files, some upload parts may remain in the S3 containers for some reason. It is recommended to enable a clean-up policy for incomplete multipart uploads. For details, see [S3 Lifecycle Management Update – Support for Multipart Uploads and Delete Markers](https://aws.amazon.com/blogs/aws/s3-lifecycle-management-update-support-for-multipart-uploads-and-delete-markers/). # Create a connection on the TD Console You must create and configure the data connection before running your query. As part of the data connection, you provide authentication to access the integration. ## Supported Authentication methods for Amazon S3 Parquet | Authentication Method | Amazon S3 Parquet | | --- | --- | | **basic** | **x** | | **session** | **x** | | **assume_role** | **x** | ## Create Authentication Your first step is to create a new authentication with a set of credentials. 1. Select **Integrations Hub**. 2. Select **Catalog**. ![](/assets/26617500.ca92fa4ab9277dca95973c6bd413fc662a3f0d04b57d58f7a8c952a29f28bbec.6bd3f7a7.png) 3. Search for your Integration in the Catalog; hover your mouse over the icon and select **Create Authentication**. ![](/assets/26617501.d6a5ad453f9fa98b4336af9506dadbd04ea44299e2f9af7fd760f75ac68e3f88.25ec5a77.png) 4. Ensure that the **Credentials** tab is selected and then enter credential information for the integration. **The Authentication fields** | Parameter | Description | | --- | --- | | **Endpoint** | S3 service endpoint override. You can find region and endpoint information in [AWS service endpoints](http://docs.aws.amazon.com/general/latest/gr/rande.md#s3_region). (Ex. [*s3.ap-northeast-1.amazonaws.com*](https://s3.ap-northeast-1.amazonaws.com/)) When specified, it will override the region setting. | | **Region** | AWS Region | | **Authentication Method** | | | **basic** | - Uses access_key_id and secret_access_key to authenticate. See [AWS Programmatic access](https://docs.aws.amazon.com/general/latest/gr/managing-aws-access-keys.md). - Access Key ID - Secret access key | | **session (Recommended)** | - Uses temporary-generated access_key_id, secret_access_key, and session_token. - Access Key ID - Secret access key - Session token | | **assume_role** | - Uses role access. [See AWS AssumeRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.md). - TD's Instance Profile - Account ID - Your Role Name - External ID - Duration In Seconds | | **anonymous** | Not Supported | | **Access Key ID** | AWS S3 issued | | **Secret Access Key** | AWS S3 issued | | **Session Token** | Your temporary AWS Session Token | | **TD's Instance Profile** | This value is provided by the TD Console. The numeric portion of the value constitutes the Account ID that you will use when you create your IAM role. | | **Account ID** | Your AWS Account ID | | **Your Role Name** | Your AWS Role Name | | **External ID** | Your Secret External ID | | **Duration** | Duration For The Temporary Credentials | 1. Select **Continue**. 2. Enter a name for your authentication, and select **Done**. ## Create an Authentication with the *assume_role* authentication method 1. Create a new authentication with the *assume_role* authentication method. 2. Make a note of the numeric portion of the value in the TD's Instance Profile field. ![](/assets/4f8559a3-4f70-4c83-b8b9-adaaec64662f_1_201_a.1331328bd097ab94743fc3d5f53d63f8cbc98ad689475b3adc5b21335a2e22e3.e7823387.jpeg) 3. Create your AWS IAM role. ![](/assets/ea9e1f37-be45-4ac6-97fa-4b46f65b0388_1_201_a.99af6c7efd4bcd9e50a04d5457e1febf3da2f104b7a090d9bcc307dc3dc8d8c3.e7823387.jpeg) ![](/assets/45c6048c-cc9e-40c1-b0f4-529e356d6e16_1_201_a.bb869c9c9ad66cdecbebc9b2b1231acd5f24c98517c6bc2cfb6ec67efcd884ef.e7823387.jpeg) ## Configure a Query result exporting ### Define your Query 1. Navigate to **Data Workbench > Queries**. 2. Select New Query. 3. Run the query to validate the result set. ![](/assets/export-integration-template_26617476-2024-02-05.7a13029cf9bb61c96f1ef577583007e067c336b7d0cabc7a9222402f86046b29.9f8ceb9f.png) ### Specify the Result Export Target 1. Select **Export Results.** ![](/assets/export-integration-template_26617476-2024-02-05-1.5b83da44688e6582943f259dce9d18f20176855b764e46205a8d99cc64749f96.9f8ceb9f.png) 2. Specify an Authentication, and start configuring the export. You can select an existing authentication or create a new authentication for the external service to be used for output. Choose one of the following: **Use an existing Integration** ![](/assets/export-integration-template_26617476-2024-02-05-2.654362d5d36805519ec64f0c0dcb467e4921dcd3e2760a0f6f6d4b2ddb672a3a.9f8ceb9f.png) **Create a new Integration** ![](/assets/amazon-s3-parquet-export-integration-2024-11-05.a702f34211c01d44256ed868e8c6024c0c4a1a3fc127da746df23a359fb518a1.e18fb3ca.png)** ![](/assets/s3-parquet-export-integration-2024-05-23-2.0f3e2a3772669df4a2b3e2e03558df8b7d140d35dddebdcf3c01b227453f21f7.2a336cf9.png) ### Configuration fields | Field | Description | | --- | --- | | Server-side encryption | Support values: - sse-s3: Server-side Encryption Mode - sse-kms: new SSE Mode | | Server-side Encryption Algorithm | Support value: - AES256 | | KMS Key ID | Symmetric AWS KMS Key ID. If there is no input for the KMS Key ID, it will create/use the default KMS Key. | | Bucket | Provide the S3 bucket name (Ex., your_bucket_name). | | Compression | Provide the S3 bucket name (Ex., your_bucket_name). | | Path | Specify the s3 filename (object key), and include an extension (Ex. test.parquet). | | Compression | The compression format of the exported files. Support value: - None - Gzip - Snappy | | Row group size | Specify the group size of the parquet file | | Page size | Specify the page size of the parquet file | | Timestamp Unit | Specify the timestamp format for parquet file. Supported time units: - Milliseconds - Microseconds | | Single file | Uncheck it to split into multiple files with the size of a Row Group | | Enable Bloom filter | Enable Bloom filter for parquet file | | Retry limit | Maximum retry times | | Retry wait | Waiting time in ms for each retry time | | Number of concurrent threads | Number of concurrent threads for S3 threads to upload | | Connections per threads | Number of HTTP connections per thread open to S3 | | Part size | Part size of multipart upload for S3 | # Create an Activation on Audience Studio You can [create an activation](https://docs.treasuredata.com/smart/project-product-documentation/create-an-activation) to export segment data or stages. ### (Optional) Schedule Query Export Jobs You can use Scheduled Jobs with Result Export to periodically write the output result to a target destination that you specify. Treasure Data's scheduler feature supports periodic query execution to achieve high availability. When two specifications provide conflicting schedule specifications, the specification requesting to execute more often is followed while the other schedule specification is ignored. For example, if the cron schedule is `'0 0 1 * 1'`, then the 'day of month' specification and 'day of week' are discordant because the former specification requires it to run every first day of each month at midnight (00:00), while the latter specification requires it to run every Monday at midnight (00:00). The latter specification is followed. #### Scheduling your Job Using TD Console 1. Navigate to **Data Workbench > Queries** 2. Create a new query or select an existing query. 3. Next to **Schedule**, select None. ![](/assets/image2021-1-15_17-28-51.f1b242f6ecc7666a0097fdf37edd1682786ec11ef80eff68c66f091bc405c371.0f87d8d4.png) 4. In the drop-down, select one of the following schedule options: ![](/assets/image2021-1-15_17-29-47.45289a1c99256f125f4d887e501e204ed61f02223fde0927af5f425a89ace0c0.0f87d8d4.png) | Drop-down Value | Description | | --- | --- | | Custom cron... | Review [Custom cron... details](#custom-cron-details). | | @daily (midnight) | Run once a day at midnight (00:00 am) in the specified time zone. | | @hourly (:00) | Run every hour at 00 minutes. | | None | No schedule. | #### Custom cron... Details ![](/assets/image2021-1-15_17-30-23.0f94a8aa5f75ea03e3fec0c25b0640cd59ee48d1804a83701e5f2372deae466c.0f87d8d4.png) | **Cron Value** | **Description** | | --- | --- | | `0 * * * *` | Run once an hour. | | `0 0 * * *` | Run once a day at midnight. | | `0 0 1 * *` | Run once a month at midnight on the morning of the first day of the month. | | "" | Create a job that has no scheduled run time. | ``` * * * * * - - - - - | | | | | | | | | +----- day of week (0 - 6) (Sunday=0) | | | +---------- month (1 - 12) | | +--------------- day of month (1 - 31) | +-------------------- hour (0 - 23) +------------------------- min (0 - 59) ``` The following named entries can be used: - Day of Week: sun, mon, tue, wed, thu, fri, sat. - Month: jan, feb, mar, apr, may, jun, jul, aug, sep, oct, nov, dec. A single space is required between each field. The values for each field can be composed of: | Field Value | Example | Example Description | | --- | --- | --- | | A single value, within the limits displayed above for each field. | | | | A wildcard `'*'` to indicate no restriction based on the field. | `'0 0 1 * *'` | Configures the schedule to run at midnight (00:00) on the first day of each month. | | A range `'2-5'`, indicating the range of accepted values for the field. | `'0 0 1-10 * *'` | Configures the schedule to run at midnight (00:00) on the first 10 days of each month. | | A list of comma-separated values `'2,3,4,5'`, indicating the list of accepted values for the field. | `0 0 1,11,21 * *'` | Configures the schedule to run at midnight (00:00) every 1st, 11th, and 21st day of each month. | | A periodicity indicator `'*/5'` to express how often based on the field's valid range of values a schedule is allowed to run. | `'30 */2 1 * *'` | Configures the schedule to run on the 1st of every month, every 2 hours starting at 00:30. `'0 0 */5 * *'` configures the schedule to run at midnight (00:00) every 5 days starting on the 5th of each month. | | A comma-separated list of any of the above except the `'*'` wildcard is also supported `'2,*/5,8-10'`. | `'0 0 5,*/10,25 * *'` | Configures the schedule to run at midnight (00:00) every 5th, 10th, 20th, and 25th day of each month. | 1. (Optional) You can delay the start time of a query by enabling the Delay execution. # (Optional) Configure Export Results in Workflow Within Treasure Workflow, you can specify the use of this integration to export data. ``` _export: td: database: td.database +a_s3_parquet_export_task: td>: export_test.sql database: ${td.database} result_connection: s3_parquet_conn result_settings: type: ``` # (Optional) Export Integration Using the CLI To output the result of a single query to an S3 bucket, add the *--result* option to the **td query** command. After the job is finished, the results are written into your S3 bucket. About the **td query** command, you can refer to [this article](https://docs.treasuredata.com/display/PD/TD+Toolbelt+Job+and+Query+Command+Reference). You can specify detailed settings to export your S3 via the *--result* parameter as follows: ```json { "type": "s3_parquet", "endpoint": "", "region": "us-east-1", "auth_method": "", "session_token": "", "account_id": "", "role_name": "", "duration_in_seconds": 3600, "access_key_id": "xxxxx", "secret_access_key": "xxxxx", "sse_type": "sse-s3", "sse_algorithm": "AES256", "kms_key_id": "arn:aws:kms:us-east-1:xxx:key/xxx", "bucket": "bucket", "path": "file.parquet", "row_group_size": 256, "page_size": 1024, "timestamp_unit": "milliseconds", "enable_bloom_filter": false, "enable_single_file": true, "compression": "gzip", "part_size": 100, "retry_limit": 7, "retry_wait_millis": 500, "number_of_concurrent_threads": 4, "connections_per_thread": 128, } ``` ## Parameters | Name | Description | Value | Default Value | Required | | --- | --- | --- | --- | --- | | type | Describe the name of the service as the destination of export. | s3_parquet | N/A | Yes | | endpoint | S3 service endpoint override. You can find region and endpoint information in AWS service endpoints. (Ex. s3.ap-northeast-1.amazonaws.com) | S3 service endpoint override | N/A | No | | region | AWS Region | AWS Region | us-east-1 | No | | auth_method | Authentication method for S3 | basic session assume_role | basic | Yes | | access_key_id | Access Key ID | Key ID | N/A | If you use auth_method as basic or session | | secret_access_key | Secret Access Key | ​  Secret access key | N/A | If you use auth_method as basic or session | | session_token | Session Token | Session Token | N/A | If you use auth_method as session | | account_id | Account ID | Account ID | N/A | If you use auth_method as assume_role | | role_name | Role Name | Role Name | N/A | If you use auth_method as assume_role | | external_id | External ID | External ID | N/A | If you use auth_method as assume_role | | duration_in_seconds | Duration of the connection | Duration in seconds | 3600 | No | | bucket | S3 bucket value | S3 bucket value | N/A | Yes | | path | S3 path with file format, ex: "/path/file.parquet" | Path file | N/A | Yes | | row_group_size | Parquet row group size | Parquet row group size | 256 | No | | page_size | Parquet page size | Parquet page size | 1024 | No | | timestamp_unit | Parquet timestamp format | - milliseconds - microseconds | milliseconds | No | | enable_bloom_filter | Enable bloom filter for parquet file | - true - false | false | No | | enable_single_file | Enable export to a single file or multiple files with one *row_group_file* in each file | - true - false | true | Y | | compression | Compression format of the exported file. | - None - Gzip - Snappy | None | No | | retry_limit | Maximum retry times | Retry limit | 7 | No | | retry_wait_millis | Waiting time in ms for each retry time | Retry wait in ms | 500 | No | | number_of_concurrent_threads | Number of concurrent threads for S3 threads to upload | Concurrent S3 threads | 4 | No | | connection_per_thread | Number of HTTP connections per thread open to S3 | Connection HTTP per thread | 16 | No | | part_size | Part size of multiple part upload for S3 | Part size | 100 | No | ## Example for Usage ```bash $ td query \ --result '{"type":"s3_parquet","auth_method":"basic","access_key_id":"access_key_id","secret_access_key":"secret_access_key","bucket":"bucket","path":"/path/file.parquet"}' \ -d sample_datasets "select * from www_access" -T presto ```