# Amazon S3 Parquet Import Integration

The import data connector for Amazon S3 enables you to import the data from Parquet files stored in S3 buckets.

## About Authentication Methods for Amazon S3 Parquet

| Authentication Method | Amazon S3 parquet |
|  --- | --- |
| **basic** | **x** |
| **session** | **x** |
| **assume_role** | **x** |


**Prerequisites**

- Basic knowledge of Treasure Data.


## S3 Bucket Policy Configuration

If you are using an AWS S3 bucket located in the same region as your TD region, the IP address from which TD is accessing to the bucket will be private and dynamically changing. If you would like to restrict access, please specify the ID of VPC instead of static IP Addresses. For example, if in the US region, configure access through vpc-df7066ba. If in the Tokyo region, configure access through vpc-e630c182 and, for the EU01 region, vpc-f54e6a9e.

Look up the region of TD Console by the URL you are logging in to TD, then refer to the data connector of your region in the URL.

See the [API Documentation](/apis/endpoints/ip-addresses-integrations-result-workers#s3-bucket-policy-configuration-for-export-and-import-integrations) for details.

## Static IP Address of Treasure Data Integration

If your security policy requires IP whitelisting, you must add Treasure Data's IP addresses to your allowlist to ensure a successful connection.

Please find the complete list of static IP addresses, organized by region, at the following [document](/apis/endpoints/ip-addresses-integrations-result-workers)

## Limitation

Treasure Data recommends that you limit the size of individual parquet files to no larger than 100 MB for optimal performance. However, this is not a hard requirement, and you can tune your partition size to meet your needs.

## Creating a New Connection on TD Console

You can use TD Console to create your data connector.

1. Open **TD Console**.
2. Navigate to **Integrations Hub** > **Catalog**.
3. Search for **S3 parquet** and select Amazon S3 parquet.
4. Select **Create Authentication**.


The New Authentication dialog opens. Depending on the Authentication Method you choose, the dialog may look like one of these screens:

**basic**

![](/assets/screen-shot-2023-03-28-at-14.24.47.f5f2bda6b9db5d192cb911d7d4ec0aca1299fe948d93cf2369056288b991aeab.e7823387.png)

**session**

![](/assets/screen-shot-2023-03-28-at-14.26.21.c0ba3d29ae2f0819045e766b368619f4ada83cfe886b516e6681b4cd40fe6557.e7823387.png)

**assume_role**

![](/assets/screen-shot-2023-03-28-at-14.26.29.f3cd34a591e47b6970307ef270d5c4e0b0dc881229742d5851e7dc8e3e6c1637.e7823387.png)

1. Configure the authentication fields, and then select **Continue**.


| Parameter | Description |
|  --- | --- |
| **Endpoint** | S3 service endpoint override. You can find region and endpoint information in [AWS service endpoints](http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region). (Ex. [*s3.ap-northeast-1.amazonaws.com*](https://s3.ap-northeast-1.amazonaws.com/)) When specified, it will override the region setting. |
| **Region** | AWS Region |
| **Authentication Method** |  |
| **basic** | - Uses access_key_id and secret_access_key to authenticate. See [AWS Programmatic access](https://docs.aws.amazon.com/general/latest/gr/managing-aws-access-keys.html).
- Access Key ID
- Secret access key

 |
| **session (Recommended)** | - Uses temporary-generated access_key_id, secret_access_key, and session_token.
- Access Key ID
- Secret access key
- Session token

 |
| **assume_role** | - Uses role access. [See AWS AssumeRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html).
- TD's Instance Profile
- Account ID
- Your Role Name
- External ID
- Duration In Seconds

 |
| **anonymous** | Not Supported |
| **Access Key ID** | AWS S3 issued |
| **Secret Access Key** | AWS S3 issued |
| **Session Token** | Your temporary AWS Session Token |
| **TD's Instance Profile** | This value is provided by the TD Console. The numeric portion of the value constitutes the Account ID that you will use when you create your IAM role. |
| **Account ID** | Your AWS Account ID |
| **Your Role Name** | Your AWS Role Name |
| **External ID** | Your Secret External ID |
| **Duration** | Duration For The Temporary Credentials |


1. Name your new AWS S3 connection, and select **Done**.


## Creating an Authentication with the assume_role authentication method

1. Create a new authentication with the assume_role authentication method.
2. Make a note of the numeric portion of the value in the TD's Instance Profile field.


![](/assets/4f8559a3-4f70-4c83-b8b9-adaaec64662f_1_201_a.1331328bd097ab94743fc3d5f53d63f8cbc98ad689475b3adc5b21335a2e22e3.e7823387.jpeg)

1. Create your AWS IAM role.


![](/assets/ea9e1f37-be45-4ac6-97fa-4b46f65b0388_1_201_a.99af6c7efd4bcd9e50a04d5457e1febf3da2f104b7a090d9bcc307dc3dc8d8c3.e7823387.jpeg)

![](/assets/45c6048c-cc9e-40c1-b0f4-529e356d6e16_1_201_a.bb869c9c9ad66cdecbebc9b2b1231acd5f24c98517c6bc2cfb6ec67efcd884ef.e7823387.jpeg)

## Transfer Your AWS S3 Data to Treasure Data

After creating the authenticated connection, the Authentications screen displays.

1. Search for the connection you created.
2. Select **New Source**.


![](/assets/s3parquetsources.e500d5384ff42a004cc3018561ba7bdd846b87a65042739db5368b3734d70970.e7823387.png)

### Connection

1. Type a name for your **Source** in the Data Transfer field**.**
2. Click **Next**.


![](/assets/s3parquetcreatesource.05fa56ff12f8143248d1016d95dd2d9f0f82dd78d66ae53973b53829ee3abff0.e7823387.png)

### Source Table

The Source dialog opens.

1. Edit the following parameters.


![](/assets/s3parquetcreatesource2.841da89442e5f2253681b4dc42966d496e591379a291cbbd3013108cba6a51c5.e7823387.png)

| **Parameters** | **Description** |
|  --- | --- |
| **Bucket** | The S3 bucket name (e.g.. *your_bucket_name*) |
| **Path Prefix** | The prefix for target keys. (e.g. *logs/data_*) |
| **Path Regex** | A regular expression to match file paths. If a file path doesn’t match the specified pattern, the file is skipped. For example, if you wanted to match only .csv files, entering the pattern *.csv$* #, would skip any filename that doesn't end with .csv. See  [regular expressions](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions). |
| **Skip Glacier Objects** | Skip processing objects stored in the Amazon Glacier storage class. If objects are stored in the Glacier storage class, but this option is not checked, an exception is thrown. |
| **Start after path** | Only paths that are lexicographically longer than this will be imported. |
| **Sub Folders Are Partitions** | Adds partition columns from Spark partition. The sub-folder's name must be in this format:  partiton_column_name=value |


There are instances where you might need to scan all the files in a directory (such as from the top-level directory "/"). In such instances, you must use the CLI to do the import.

**Example**

You can configure your EMR to create parquet files that contain your data in files as follow:


```
[your_bucket] - [YM=202010] - [E231A697YXWD39.2020-10-29-15.a103fd5a.parquet]
[your_bucket] - [YM=202010] - [E231A697YXWD39.2020-10-30-15.b2aede4a.parquet]
[your_bucket] - [YM=202010] - [E231A697YXWD39.2020-10-31-01.594fa8e6.parquet]
```

In this case, the Source Table settings are as shown:

- **Bucket**: your_bucket
- **Path Prefix**: YM=202010a/
- **Path Regex**: * (Not Required)
- **Start after path**: YM=202010/E231A697YXWD39.2020-10-29-15.a103fd5a.parquet


### Data Settings

1. Select **Next**.
The Data Settings page opens.
2. Optionally, edit the data settings or skip this page of the dialog.


![](/assets/screen_shot_2023-03-28_at_15_15_08.fe769b33ed2ca687f6bb9b1fbaf3eecebb307a0ac94812682b23463f1842dc3f.e7823387.png)

| **Parameters** | **Description** |
|  --- | --- |
| **Retry Limit** | Maximum number of retries |
| **Initial Retry Interval in Millis** | The initial retry interval in milliseconds. |
| **Max Retry Wait in Millis** | The maximum retry interval. After the initial retry, the wait interval will be doubled until this maximum is reached. |
| **Number of connector threads** | The number of files handles that can be processed in parallel. |
| **Number of threads for S3 file downloads** | The number of connections that can be used to download blocks of a file. |
| **Number of prefetch block** | The number of prefetch blocks to use. |
| **Prefetch block size in MB** | The prefetch block size, specified in MB |


### Filters

Filters are available in the Create Source or Edit Source import settings for your S3, FTP, or SFTP connectors.

Import Integration Filters enable you to modify your imported data after you have completed [Editing Data Settings](https://docs.treasuredata.com/smart/project-product-documentation/editing-data-settings) for your import.

To apply import integration filters:

1. Select **Next** in Data Settings.The Filters dialog opens.
2. Select the filter option you want to add.![](/assets/image-20200609-201955.eed6c6da800ba40d1d98b92e767d9a8f7500cad8a9d4079121190b7d34c23294.c7246827.png)
3. Select **Add Filter.** The parameter dialog for that filter opens.
4. Edit the parameters. For information on each filter type, see one of the following:


- Retaining Columns Filter
- Adding Columns Filter
- Dropping Columns Filter
- Expanding JSON Filter
- Digesting Filter


1. Optionally, to add another filter of the same type, select **Add** within the specific column filter dialog.
2. Optionally, to add another filter of a different type, select the filter option from the list and repeat the same steps.
3. After you have added the filters you want, select **Next.**The Data Preview dialog opens.


### Data Preview

You can see a [preview](/products/customer-data-platform/integration-hub/batch/import/previewing-your-source-data) of your data before running the import by selecting Generate Preview. Data preview is optional and you can safely skip to the next page of the dialog if you choose to.

1. Select **Next**. The Data Preview page opens.
2. If you want to preview your data, select **Generate Preview**.
3. Verify the data.


### Data Placement

For data placement, select the target database and table where you want your data placed and indicate how often the import should run.

1. Select **Next.** Under Storage, you will create a new or select an existing database and create a new or select an existing table for where you want to place the imported data.
2. Select a **Database** > **Select an existing** or **Create New Database**.
3. Optionally, type a database name.
4. Select a **Table**> **Select an existing** or **Create New Table**.
5. Optionally, type a table name.
6. Choose the method for importing the data.
  - **Append** (default)-Data import results are appended to the table.
If the table does not exist, it will be created.
  - **Always Replace**-Replaces the entire content of an existing table with the result output of the query. If the table does not exist, a new table is created.
  - **Replace on New Data**-Only replace the entire content of an existing table with the result output when there is new data.
7. Select the **Timestamp-based Partition Key** column.
If you want to set a different partition key seed than the default key, you can specify the long or timestamp column as the partitioning time. As a default time column, it uses upload_time with the add_time filter.
8. Select the **Timezone** for your data storage.
9. Under **Schedule**, you can choose when and how often you want to run this query.


#### Run once

1. Select **Off**.
2. Select **Scheduling Timezone**.
3. Select **Create & Run Now**.


#### Repeat Regularly

1. Select **On**.
2. Select the **Schedule**. The UI provides these four options: *@hourly*, *@daily* and *@monthly* or custom *cron*.
3. You can also select **Delay Transfer** and add a delay of execution time.
4. Select **Scheduling Timezone**.
5. Select **Create & Run Now**.


After your transfer has run, you can see the results of your transfer in **Data Workbench** > **Databases.**

## Validating Your Data Connector Jobs

### How do I troubleshoot data import problems?

Review the job log. Warnings and errors provide information about the success of your import. For example, you can [identify the source file names associated with import errors](https://docs.treasuredata.com/smart/project-product-documentation/data-import-error-troubleshooting).

To find out more about a specific job, you can select that job and see details. Depending on the type of job, you can see some or all of the following: results, query, output logs, engine logs, details, and destination.

1. Open the TD Console.
2. Navigate to **Jobs**. You can review the number of jobs which is listed in the upper right of the page.


![](/assets/managing_jobs__360001457707__mceclip0.da001c56a934e6cb6001326ffbcf4a0e84c98aeebcb735879d974db1df660a4c.22d00320.png)
3. Optionally, use filters to reduce the listing of jobs to locate what you are interested in, including filtering by job owner, date, and database name.
4. Select a job to open it and view results, query definition, logs, and other details.

![](/assets/managing_jobs__360001457707__mceclip1.8d80bfeb65a6425ba20c1ae0657bc6024a69694a10772e71c54baccf74c5c7f7.22d00320.png)
5. Each tab has different information about the job.

| Tab Name | Description |
|  --- | --- |
| Results | - View the imported data from the job.
- From here you can copy the results to the clipboard or download them as a CSV file.

 |
| Query | - View the query syntax of the job
- Launch a query editor
- Copy queries and use to create new queries or workflows
- Refine queries to improve efficiency

 |
| Output and Engine Logs | - Log information can be reviewed for run times, query result numbers, and error codes
- Log information can be copied to the clipboard

 |
| Details | View further details:- query name
- type
- job id
- status
- duration
- scheduled and actual times
- result count and size
- runner
- database queried
- priority

 |
| Destination | Here you can view details of an export integration configuration (not applied to an import integration):   - integration - type - settings |


## Data type mapping

The Parquet file format uses minimal data types, known as primitive types, which are designed to optimize disk storage efficiency. On the other hand, logical types are used to extend the types, by specifying how the primitive types should be interpreted. When imported, those types will be converted into data types supported by TD

| **Parquet data type** | **Type mapping when import to TD** |
|  --- | --- |
| Primitive types |  |
| boolean | boolean |
| int32 | long |
| int64 | double |
| int96 | string |
| float, double | double |
| byte_array, fixed_len_byte_array | string |
| Logical types |  |
| byte type, short type, integer type, long type | long |
| float type, double type | double |
| decimal | string |
| array type | string |
| map, struct | json |
| string type, binary type | string |


## Importing from AWS S3 Parquet via CLI (Toolbelt)

Optionally, you can use the TD Toolbelt to configure the connection, create the job, and schedule job execution. Ensure that the latest version of the TD Toolbelt is installed before setting up the integration

### Create Seed Config File (seed.yml)

Configure the *seed.yml* file as shown in the following example with your AWS access keys. You must also specify the bucket name and source file name. Optionally you can specify path_prefix to match multiple files. In the example below, path_prefix: `path/to/sample_file will match`

- `path/to/sample_201501.parquet`
- `path/to/sample_201502.parquet`
- `path/to/sample_201505.parquet`
- `etc.`


Using path_prefix with leading '/', can lead to unintended results. For example: "path_prefix: /path/to/sample_file" would result in plugin looking for file in s3://sample_bucket//path/to/sample_file which is different on S3 than the intended path of s3://sample_bucket/path/to/sample_file.


```yaml
in:
  type: s3_parquet
  access_key_id: XXXXXXXXXX
  secret_access_key: YYYYYYYYYY
  bucket: sample_bucket
  # path to the *.parquet file on your s3 bucket
  path_prefix: path/to/sample_file
  path_match_pattern: \.parquet$ # a file will be skipped if its path doesn't match with this pattern

  ## some examples of regexp:
  #path_match_pattern: /archive/ # match files in .../archive/... directory
  #path_match_pattern: /data1/|/data2/ # match files in .../data1/... or .../data2/... directory
out:
  mode: append
```

if you reuse an existing authentication, set the Authentication ID to the value of **td_authentication_id** config key.  This is required for the assume-role authentication method. See [Reusing the existing Authentication](/int/amazon-s3-parquet-import-integration#h1__756878989).


```
in:
  type: s3_qarquet
  td_authentication: xxxx
  bucket: sample_bucket
  path_prefix: path/to/sample_file

out:
  mode: append
```

### Guess Fields (Generate load.yml)

*connector:guess* automatically reads the source files and assesses the file format and the fields and columns.


```bash
td connector:guess seed.yml -o load.yml
```

If you look at the load.yml file, you can see the "guessed"  file format definitions, including file formats, encodings, column names, and types.


```yaml
in:
  type: s3_parquet
  access_key_id: XXXXXXXXXX
  secret_access_key: YYYYYYYYYY
  bucket: sample_bucket
  path_prefix: path/to/sample_file

out:
  mode: append
```

### Execute Load Job

Submit the load job. It may take a couple of hours, depending on the size of the data. Specify the Treasure Data database and table where the data should be stored.

Treasure Data recommends that you specify s *--time-column* option because Treasure Data’s storage is partitioned by time (see [data partitioning](https://docs.treasuredata.com/smart/project-product-documentation/data-partitioning-in-treasure-data)). If the option is not provided, the data connector chooses the first *long* or *timestamp* column as the partitioning time. The type of the column specified by *--time-column* must be either of type *long* or *timestamp*.

If your data doesn’t have a time column you can add a time column by using *add_time* filter option. For more details see [add_time filter plugin](https://docs.treasuredata.com/smart/project-product-documentation/add_time-filter-function).


```bash
td connector:issue load.yml --database td_sample_db --table td_sample_table \
--time-column created_at
```

In the example below, the connector:issue command assumes that you have already created a *database(td_sample_db)*and a *table(td_sample_table)*. If the database or the table do not exist in Treasure Data, this command will fail. Create the database and table manually or use *--auto-create-table* option with *td connector:issue* command to auto-create the database and table:


```bash
td connector:issue load.yml --database td_sample_db --table td_sample_table --time-column created_at --auto-create-table
```

The data connector does not sort records on the server-side. To use time-based partitioning effectively, sort records in files beforehand.

If you have a field called *time*, you don’t have to specify the *--time-column* option.


```
$ td connector:issue load.yml --database td_sample_db --table td_sample_table
```

### Setting IAM Permissions

The IAM credentials specified in the YML configuration file, which are used for the *connector:guess* and *connector:issue* commands, need to have permissions for the AWS S3 resources that they need to access. If the IAM user does not have these permissions, configure the user with one of the predefined Policy Definitions or create a new Policy Definition in JSON format.

The following example is based on the [Policy Definition reference format](http://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_elements.html). It gives the IAM user *read only* permissions (through GetObject and ListBucket actions) to "your-bucket."


```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket",
        "arn:aws:s3:::your-bucket/*"
      ]
    }
  ]
}
```

Replace "`your-bucket"` with the actual name of your S3 bucket.

### Using AWS Security Token Service (STS) as a Temporary Credentials Provider

In certain cases, IAM basic authentication through access_key_id and secret_access_key might be too risky (even though the secret_access_key is never clearly shown when a job is executed or after a session is created).

The S3 data connector can use AWS Secure Token Service (STS) to provide [Temporary Security Credentials](http://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp.html). Using AWS STS, any IAM user can use his own access_key_id and secret_access_key to create these temporary keys with specific expiration times :

- new_access_key_id
- new_secret_access_key
- session_token keys


The following are types of Temporary Security Credentials:

- [**Session Token**](http://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_control-access_getsessiontoken.html)
The simplest Security Credentials with a specified expiration time. The temporary credentials have the same access as the IAM user the that generated them. These credentials are valid as long as they are not expired and the permissions of the original IAM user have not changed.
- [**Federation Token**](http://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_control-access_getfederationtoken.html)
This adds an extra layer of permission control over the Session Token above. When generating a Federation Token, the IAM user is required to specify a Permission Policy definition. The scope can be used to restrict which resources the bearer of the Federation Token can have access to (which can be less that the access of IAM user granting the permission). Any [Permission Policy](http://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_elements.html) definition can be used, but the scope of the permissions is limited to the same, or a subset of, permissions of the IAM who generated the token. As for the Session Token, the Federation Token credentials are valid as long as they are not expired and the permissions associated to the original IAM credentials don’t change.


AWS STS Temporary Security Credentials can be generated using the [AWS CLI](https://aws.amazon.com/cli/) or the [AWS SDK](https://aws.amazon.com/tools/) in the language of your choice.

#### Session Token


```bash
$ aws sts get-session-token --duration-seconds 900
```

#### Federation Token

In this example,

- `temp_creds` is the name of the Federated token or the user's temp credentials.
- `bucketname` is the name of the S3 bucket being granted access. (Refer to the [ARN specification](http://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html#arn-syntax-s3) for more details)
- `s3:GetObject` and `s3:ListBucket` are the basic read operation for a AWS S3 bucket.


```bash
$ aws sts get-federation-token --name temp_creds --duration-seconds 900 \
  --policy '{"Statement": [{"Effect": "Allow", "Action": ["s3:GetObject", "s3:ListBucket"], "Resource": "arn:aws:s3:::bucketname"}]}'
```

AWS STS credentials cannot be revoked. They will remain effective until expired, or until you delete or remove the permissions of the original IAM user used to generate the credentials.

When your Temporary Security Credentials are generated, include the `SecretAccessKey`, `AccessKeyId`, and `SessionToken` in your *seed.yml* file  and execute the Data Connector for S3 as usual..


```yaml
in:
  type: s3_parquet
  auth_method: session
  access_key_id: XXXXXXXXXX
  secret_access_key: YYYYYYYYYY
  session_token: ZZZZZZZZZZ
  bucket: sample_bucket
  path_prefix: path/to/sample_file
```

#### Credential Expiration

Because STS credentials expire after a specified amount of time, the data connector job that uses the credential might eventually start failing. Currently, if the STS credentials are reported expired, the data connector job retries up to the maximum number of times (5) and eventually completes with a statis pf "error."

To confirm the import, see the steps in [Validating Your Data Connector Jobs](/int/amazon-s3-parquet-import-integration#h1__1293994107).

## Related Topics

- See [Scheduling Using TD Toolbelt](/int/scheduling-a-data-connector-job-execution-from-the-cli) for periodic execution of this integration
- See [Using TD Workflow with Integrations](/int/using-td-workflow-with-td-integrations) to trigger your created Source from a workflow


### What can I do if the data connector for the S3 job is running for a long time?

Check the count of S3 files that your connector job is ingesting. If there are over 10,000 files, the performance degrades. To mitigate this issue, you can:

- Narrow path_prefix option and reduce the count of S3 files.
- Set the  min_task_size option to 268,435,456 (256MB).


### Some best practices we should follow

- Provide files with many row_groups, not one row_group per file
- If there are many files in one place, please try to split them into multiple jobs using the parameters Path Prefix and Path Regex.
Each job should have 20-40 files.
For example:


```
path_prefix: folder/sub_folderpath_match_pattern: folder/sub_folder/regrex
```