# Microsoft Azure Data Lake Storage Import Integration

Microsoft Azure Data Lake Storage is the industrial leading storage solution for big data. Our import integration helps ingest the parquet files from your Azure Data Lake Storage into Treasure Data for consolidation with other sources set up in Treasure Data.

## What can you do with this Integration?

- **Copy all existing data files**: You can copy all of your Azure Data Lake parquet files into Treasure Data in order to migrate from the system.
- **Ingest data directly**: Import data directly from the Azure Data Lake into Treasure Data instead of having to use a bridge system.


## Prerequisites

- Basic knowledge of Treasure Data
- Basic knowledge of Microsoft Azure Data Lake
- Microsoft Azure Data Lake account with sufficient permissions to create Shared Access Signatures and download files


## Limitations

- Supports only Azure Data Lake Storage (v2)
- Supports only one level of partition keys as a Spark Partition ([https://spark.apache.org/docs/2.4.0/sql-data-sources-parquet.html#partition-discovery](https://spark.apache.org/docs/2.4.0/sql-data-sources-parquet.html#partition-discovery))
- Supports only snappy compression codec
- No data preview is available due to the time required to process the download and read the parquet file schema
- Supports only the HTTP Proxy method
- Does not support BlobStorageEvents or SoftDelete for Azure
- The delta file format is not supported. (For other common file formats such as CSV and TVS, use Microsoft Azure Blob Storage).


Parquet file size expectation

- Treasure Data recommends that you limit Row Group Size to less than 3.4GB. If the Row Group size is larger than 3.4GB, your import job may encounter an "Out of Memory" error.
If this happens, you will need to re-partition your data into smaller parquet file sizes and then retry the import job.


## Obtain the Access Key or Shared Access Signature from Microsoft Azure Portal

1. Navigate to Storage Account management on your Azure Portal
![](/assets/image2021-9-10_13-50-26.8681dfcfe52df8be55de9a9057745c441f0386bc3bbe7d44fd542a25e117a012.be2392e9.png)
2. Select Access keys or Shared access signatures
3. Copy the keys to use in the TD authentication configuration


## Use the TD Console to Create Your Connection

### Create a New Connection

In Treasure Data, you must create and configure the data connection prior to running your query. As part of the data connection, you provide authentication to access the integration.

1. Open the TD Console.
2. Navigate to the Integrations Hub > Catalog.
3. Click the search icon on the far-right of the Catalog screen, and enter Azure Lake.
4. Hover over the Microsoft Azure Data Lake connector and selectCreate Authentication. 
![](/assets/msazurelake.aaed38ee4d153a8565b6ddef84296c06764d2bbc14f918d9dacf1a27da62c307.be2392e9.png)
5. Choose one of the following authentication methods:


Account Key authentication method
Shared Access Signatures authentication method
Proxy Setting (Optional)
On Premises Setting (Optional)

1. Select **Shared Access Signature** from the Authentication Mode dropdown menu.


![](https://docs.treasuredata.com/download/attachments/17407455/Screen%20Shot%202021-08-30%20at%2011.29.09.png?version=1&modificationDate=1630297628500&api=v2)

1. Enter your storage **Account Name**.
2. Enter your **Account Key** copied from the Azure Portal.
3. Select **Shared Access Signature** from the Authentication Mode dropdown menu.


![](https://docs.treasuredata.com/download/attachments/17407455/Screen%20Shot%202021-08-30%20at%2011.29.09.png?version=1&modificationDate=1630297628500&api=v2)

1. Enter your storage **Account Name**.
2. Enter your **Account Key** copied from the Azure Portal.
3. If you want to run through your HTTP Proxy, select your Proxy Type.


![](https://docs.treasuredata.com/download/attachments/17407455/image2021-9-1_16-15-46.png?version=1&modificationDate=1630487604199&api=v2)**

1. Enter **Proxy Host**, **Proxy Port**, **Proxy Username**, and **Proxy Password**
2. If your Data Lake is on premise, specify your On Premises Setting


![](https://docs.treasuredata.com/download/attachments/17407455/image2021-9-1_16-13-19.png?version=1&modificationDate=1630487457048&api=v2)

1. Enter the Premises Host.
For On Premises Setting, only Shared Access Signatures authentication method is supported.
2. Enter a name for your connection.
3. Choose to share the authentication with others or not.
4. Select Continue.


### Transfer Your Data to Treasure Data

After creating the authenticated connection, you are automatically taken to Authentications.

1. Search for the connection you created.
2. Select **New Source**.
3. Type a name for your **Source** in the Data Transfer field**.**
4. Select **Next**. The Source Table dialog opens


![](/assets/microsoft-azure-data-lake-storage-import-integration-2024-02-09.ee09ed7365f2ae06e3af338e48bcdd7edd953d3dc4da9b3347eafb96b2fc71db.be2392e9.png)

1. Edit the following parameters:


| Parameters | Description |
|  --- | --- |
| Container | The container name of the Data Lake |
| Path Prefix | The path to the folders container with all your files to ingest |
| Path Match Pattern (optional) | Only import the files with this Regular Expression pattern |
| Sub folders are partitions | Enable to specify that you use Spark Partition folder structure. The sub folder must have a name in this format <column_name>=value |
| Enable Schema Evolution | Enable schema evolution for parquet |
| Incremental Loading | Enable incremental mode. |
| End time | Only files that have been modified after the time specified will be imported. |
| Schema Settings | If "Sub Folders Are Partitions" is enabled, you must provide the column name and data type of that partition column |
| Include Columns | Limits the columns that will be imported. Only columns specified in the list will be imported. If the list is empty, all columns will be imported. |


1. Select Next.
The Data Settings page can be modified for your needs or you can skip the page.
2. Optionally, edit the following parameters:


| Parameter | Description |
|  --- | --- |
| Retry Limit | Maximum number of retries |
| Initial retry interval in millis | The initial retry interval in milliseconds |
| Max retry wait in millis | The maximum retry interval. After the initial retry, the wait interval will be doubled until this maximum is reached. |
| repartition_number | Split input file into smaller files to bypass Out Of Memory exception. This apply to large data files. Default value is 100. |


1. Select **Next**.


### Data Preview

Data Preview is not supported with this integration. The preview only displays example data.

### Data Placement

For data placement, select the target database and table where you want your data placed and indicate how often the import should run.

1. Select **Next.** Under Storage, you will create a new or select an existing database and create a new or select an existing table for where you want to place the imported data.
2. Select a **Database** > **Select an existing** or **Create New Database**.
3. Optionally, type a database name.
4. Select a **Table**> **Select an existing** or **Create New Table**.
5. Optionally, type a table name.
6. Choose the method for importing the data.
  - **Append** (default)-Data import results are appended to the table.
If the table does not exist, it will be created.
  - **Always Replace**-Replaces the entire content of an existing table with the result output of the query. If the table does not exist, a new table is created.
  - **Replace on New Data**-Only replace the entire content of an existing table with the result output when there is new data.
7. Select the **Timestamp-based Partition Key** column.
If you want to set a different partition key seed than the default key, you can specify the long or timestamp column as the partitioning time. As a default time column, it uses upload_time with the add_time filter.
8. Select the **Timezone** for your data storage.
9. Under **Schedule**, you can choose when and how often you want to run this query.


#### Run once

1. Select **Off**.
2. Select **Scheduling Timezone**.
3. Select **Create & Run Now**.


#### Repeat Regularly

1. Select **On**.
2. Select the **Schedule**. The UI provides these four options: *@hourly*, *@daily* and *@monthly* or custom *cron*.
3. You can also select **Delay Transfer** and add a delay of execution time.
4. Select **Scheduling Timezone**.
5. Select **Create & Run Now**.


After your transfer has run, you can see the results of your transfer in **Data Workbench** > **Databases.**

## Optionally Configure Workflow

Within Treasure Workflow, you can specify the use of this data connector within a workflow.

Learn more at [Using Workflows to Export Data with the TD Toolbelt](https://api-docs.treasuredata.com/en/tools/cli/api/#workflow-commands).

### Example Workflow for Azure Data Lake Storage Input


```yaml
+setup:
  echo>: start ${session_time}

+import-with-sql:
  td_load>: config.yml
  database: ${td.some_database}
  table: ${td.some_table2}

+teardown:
  echo>: finish ${session_time}
```

### Example (config.yml)

The following is an example configuration file to fetch files from Azure Data Lake:


```
input:    type: azure_datalake  authentication_mode: account_key  account_name: tdadl  account_key: fjZliu61iZV  sas_token: ?sv=sas_token  container_name: test  path_prefix:  /traffic_data/partition/collisionrecords2/  path_match_pattern: /traffic_data/partition/collisionrecords2/*  subfolder_partitions: true  proxy_type: none  proxy_host: host  proxy_port: 3128  proxy_username: tdpy  proxy_password: 321tre  repartition_number: 100  schema_evolution: false  incremental: true  last_updated_at: "2023-12-20T03:51:20.937Z"  include_columns: [          col0,          col3,          col4,      ]
```

## Enabling Soft Delete for Blobs

When you enable the `"Soft delete for blobs" feature, the connector doesn't work, because it uses the REST API.

To use this feature, login in the Azure Portal, and from the Overview page, change the Properties of Data Lake Storage so that "Hierarchical namespace" is Enabled.

Error example looks like this.


```
Caused by: org.apache.hadoop.fs.azurebfs.contracts.exceptions.AbfsRestOperationException: Operation failed: "Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.", 403, HEAD, https://kfcusprdanalyticsadl.dfs.core.windows.net/data-science-container/?upn=false&action=getAccessControl&timeout=90&sp=racwdlmep&st=2024-08-26T15:03:55Z&se=2024-10-15T23:03:55Z&spr=https&sv=2022-11-02&sr=d&sig=XXXXX&sdd=2s, rId: 596f51a1-601f-000d-6fe0-f7dd2e000000
```

![](/assets/microsoft-azure-data-lake-storage-import-integration-2024-01-22-1.b750ccc25eaedbfedd0afde60be2d202068ada00dd83d08a52c7624981a7b4d2.be2392e9.png)