# Microsoft Azure Datalake Parquet Export Integration

With its' columnar data format, Parquet is particularly beneficial for analytics workloads where queries often need to access only a subset of the columns in a dataset, making it a preferred option in numerous data pipeline strategies.

This TD export integration lets you generate parquet files from Treasure Data job results and upload them directly to Azure Data Lake Storage.

## Prerequisites

- Basic knowledge of Treasure Data.
- Basic knowledge of Azure Data Lake Storage.


## Requirements and Limitations

- Azure Data Lake Storage Account configured with Hierarchical Namespace enabled.
- Azure Data Lake Access Keys or Shared Access Signature is set up. For more details, refer to [Obtain Azure Data Lake Credentials](/int/obtain-azure-data-lake-credentials)
- Export Integration only supports Shared Access Signature with Account or Container level. It does not support Shared Access Signature with Folder or Directory level.


## Create a Connection using TD Console

On TD Console, you must create and configure the data connection before running your query. As part of the data connection, you provide authentication to access the integration following the below steps

1. Open **TD Console**
2. Navigate to **Integrations Hub** > **Catalog**
3. Search for your integration and select **Azure Data Lake**
4. Select **Create Authentication**, and enter credential information for the integration


**The Authentication fields**

| Parameter | Description |
|  --- | --- |
| Authentication Mode | Choose one of the two supported authentication modes:  - Access Key - Shared Access Signatures |
| Account Name | The Azure storage account name having the hierarchical namespaces enabled. |
| Account Key | The Account Key or SAS Token corresponds to the selected Authentication Mode, which is obtained from the Azure platform. See [Obtain Azure Data Lake Credentials](/int/obtain-azure-data-lake-credentials) for more details. |
| SAS Token |  |
| Proxy Type | Select **HTTP** if the integration to Azure Data Lake Storage is via an HTTP proxy. |
| Proxy Host | Proxy Host value when Proxy Type is HTTP |
| Proxy Port | Proxy Port value when Proxy Type is HTTP |
| Proxy Username | .Proxy username when Proxy Type is HTTP |
| Proxy Password | Proxy password when proxy type is HTTP |
| On Premises | Ignore this parameter as it is not applied to the export integration. |


1. Select **Continue**.
2. Enter a name for the Authentication, and select **Done**.


## Activate a Segment in Audience Studio

You can also send segment data to the target platform by creating an activation in the Audience Studio.

1. Navigate to **Audience Studio**.
2. Select a parent segment.
3. Open the target segment, right-mouse click, and then select **Create Activation.**
4. In the **Details** panel, enter an Activation name and configure the activation according to the previous section on Configuration Parameters.
5. Customize the activation output in the **Output Mapping** panel.


![](/assets/ouput.b2c7f1d909c4f98ed10f5300df858a4b19f71a3b0834df952f5fb24018a5ea78.8ebdf569.png)

- Attribute Columns
  - Select **Export All Columns** to export all columns without making any changes.
  - Select **+ Add Columns** to add specific columns for the export. The Output Column Name pre-populates with the same Source column name. You can update the Output Column Name. Continue to select **+ Add Columns**to add new columns for your activation output.
- String Builder
  - **+ Add string** to create strings for export. Select from the following values:
    - String: Choose any value; use text to create a custom value.
    - Timestamp: The date and time of the export.
    - Segment Id: The segment ID number.
    - Segment Name: The segment name.
    - Audience Id: The parent segment number.


1. Set a **Schedule**.


![](/assets/snippet-output-connector-on-audience-studio-2024-08-28.a99525173709da1eb537f839019fa7876ffae95045154c8f2941b030022f792c.8ebdf569.png)

- Select the values to define your schedule and optionally include email notifications.


1. Select **Create**.


## Configure a Query Result to export data

TD Console supports multiple ways to export data. Perform the following steps to export data from the Data Workbench.

1. Navigate to **Data Workbench > Queries.**
2. Select **New Query**, and define your query.
3. Select **Export Results** to configure the data exporting.


![](/assets/screenshot-2024-05-27-at-13.08.60.c660e2874c5efd5898076c87205c9c1c3579e1232c32206db66443388b0163ee.9c1a955f.png)

1. Select an existing authentication, or create a new one following the above section.
2. Configure the exporting parameters, and select **Done.**
3. Optionally, a schedule could be configured to export data periodically to the target destination specified. For more details, see [Scheduling Jobs on TD Console](/products/customer-data-platform/job-management/scheduling-jobs-using-td-console)


**The Configuration fields**

| Field | Description |
|  --- | --- |
| Container name | The name of an existing Azure storage container. |
| File path | The export location in the Azure Storage, including a file name and the parquet extension, i.e, "folder/file.parquet". |
| Overwrite existing file? | If the checkbox is checked, the existing file will be overwritten. |
| Parquet compression | The supported parquet compression:  - Uncompressed - Gzip - Snappy |
| Row group size (MB) | Parquet compression algorithms are only applied per row group, so the larger the row group size, the more opportunities to compress the data. Use this field to fine-tune the data exporting based on the actual data schema. Min: 10 MB, max: 1024 MB, default: 128 MB |
| Page size (KB) | The smallest unit that must be read fully to access a single record. If this value is too small, the compression will deteriorate. Use this field to fine-tune the data exporting based on the actual data schema. Min: 8 KB, max: 2048 KB, default: 1024 KB. |
| Single file? | When enabled, the single file is created otherwise multiple files are generated based on the row group size (Note that the output file size varies if compression is enabled) |
| Advanced configuration | Enable this option to adjust the advanced configuration below. |
| Timestamp Unit | Configure timestamp unit in milliseconds or microseconds. Default: milliseconds. |
| Enable Bloom filter | A Bloom filter is a space-efficient structure that approximates set membership, offering definitive "no" or probable "yes" answers. This option could be enabled to include a Bloom filter in the generated parquet files. |
| Retry limit | Maximum retry, min: 1, max: 10, default: 5 |
| Retry Wait (second) | Time to wait before executing retry, min: 1 second, max: 300 seconds, default: 3 seconds |
| Number of concurrent threads | The number of current requests to Azure Storage service, min: 1, max: 8, default: 4 |
| Part size (MB) | The Azure Storage supports uploading files in small chunks (Request buffer size). Min: 1 MB, max: 100 MB, default: 8 MB |


## (Optional) Export Integration Using the CLI

The CLI(Toolbelt) could be used to trigger the Query Result exporting to Azure Data Lake in the parquet file format. You need to specify the information for export to your Azure Data Lake Storage as `--result` option of `td query` command. About `td query` command, see [this article](/tools/cli-and-sdks/td-toolbelt-job-and-query-command-reference).

The format of the option is JSON and the general structure is as follows.


```json
{"type": "azure_datalake",  "authentication_mode": "account_key",  "account_name": "demo_account_name",  "proxy_type": "none",  "container_name": "demo_container_name",  "file_path": "joetest/test-3.parquet",  "overwrite_file": "true",  "compression": "uncompressed",  "row_group_size": "128",  "page_size": 1024,  "single_file": "false",  "advanced_configuration": "false"}
```

**Example for Usage**


```bash
td query --result '{"type":"azure_datalake","authentication_mode":"account_key","account_name": "account name", "account_key": "account key", "proxy_type": "none", "container_name": "container", "file_path": "joe/test.parquet", "overwrite_file": "false", "compression": "uncompressed", "row_group_size": 128, "page_size": 1024, "single_file": true, "advanced_configuration": false}' -d sample_datasets "select * from sample_table" -T presto
```

## Obtain Azure Data Lake Credentials

Login to Microsoft Azure Portal and choose **Storage accounts**

![](/assets/screenshot-2024-05-27-at-10.19.18.26117bf2021455e8c8a575190b34f6b67bd02923106dfcde1c364fe507bd4236.9c1a955f.png)

### Check Hierarchical Namespace is Enabled

Select your storage account and make sure that the hierarchical namespace option is enabled.

![](/assets/screenshot-2024-05-27-at-10.49.00.ad9deae8d4da424ce36300ea7be3d0cbb0a076f768e817de307d7e0659a80a31.9c1a955f.png)

### Retrieve Access key

To use the Access Key for the integration:

1. Select the **Access keys** from the Security + networking menu**.**
2. Selectthe **Show** button and copy the access key.
![](/assets/screenshot-2024-05-27-at-11.22.44.466fe62431c12a4cc734107c8bf0706f0a5ba57c3b52d9b035ccccfb2b7cd7d5.9c1a955f.png)


### Use Shared Access Signature

To use the Shared access signature for the integration:

1. Select the **Shared access signature** from the Security + networking menu.
![](/assets/screenshot-2024-05-27-at-11.30.25.f5f60f076639a2947e85a0f93ee7b379abb27ac1eabbc2e4b74e1fda8e29fad1.9c1a955f.png)
2. Configure the permission & expiration date for the shared access signature.
3. Select the **Generate SAS and connection string** button, and copy the SAS token.
![](/assets/screenshot-2024-05-27-at-11.35.00.b8faefa5549c13692b6f0b9a28e67ab9b9abd9a0b0c237bdfbde5ea56bae5fee.9c1a955f.png)