Skip to content
Last updated

Microsoft Azure Data Lake Storage Import Integration

Microsoft Azure Data Lake Storage is the industrial leading storage solution for big data. Our import integration helps ingest the parquet files from your Azure Data Lake Storage into Treasure Data for consolidation with other sources set up in Treasure Data.

What can you do with this Integration?

  • Copy all existing data files: You can copy all of your Azure Data Lake parquet files into Treasure Data in order to migrate from the system.
  • Ingest data directly: Import data directly from the Azure Data Lake into Treasure Data instead of having to use a bridge system.

Prerequisites

  • Basic knowledge of Treasure Data
  • Basic knowledge of Microsoft Azure Data Lake
  • Microsoft Azure Data Lake account with sufficient permissions to create Shared Access Signatures and download files

Limitations

  • Supports only Azure Data Lake Storage (v2)
  • Supports only one level of partition keys as a Spark Partition (https://spark.apache.org/docs/2.4.0/sql-data-sources-parquet.html#partition-discovery)
  • Supports only snappy compression codec
  • No data preview is available due to the time required to process the download and read the parquet file schema
  • Supports only the HTTP Proxy method
  • Does not support BlobStorageEvents or SoftDelete for Azure
  • The delta file format is not supported. (For other common file formats such as CSV and TVS, use Microsoft Azure Blob Storage).

Parquet file size expectation

  • Treasure Data recommends that you limit Row Group Size to less than 3.4GB. If the Row Group size is larger than 3.4GB, your import job may encounter an "Out of Memory" error.

    If this happens, you will need to re-partition your data into smaller parquet file sizes and then retry the import job.

Obtain the Access Key or Shared Access Signature from Microsoft Azure Portal

  1. Navigate to Storage Account management on your Azure Portal
  2. Select Access keys or Shared access signatures
  3. Copy the keys to use in the TD authentication configuration

Use the TD Console to Create Your Connection

Create a New Connection

In Treasure Data, you must create and configure the data connection prior to running your query. As part of the data connection, you provide authentication to access the integration.

  1. Open the TD Console.
  2. Navigate to the Integrations Hub > Catalog.
  3. Click the search icon on the far-right of the Catalog screen, and enter Azure Lake.
  4. Hover over the Microsoft Azure Data Lake connector and selectCreate Authentication. 
  5. Choose one of the following authentication methods:

Account Key authentication method Shared Access Signatures authentication method Proxy Setting (Optional) On Premises Setting (Optional)

  1. Select Shared Access Signature from the Authentication Mode dropdown menu.

  1. Enter your storage Account Name.

  2. Enter your Account Key copied from the Azure Portal.

  3. Select Shared Access Signature from the Authentication Mode dropdown menu.

  1. Enter your storage Account Name.

  2. Enter your Account Key copied from the Azure Portal.

  3. If you want to run through your HTTP Proxy, select your Proxy Type.

**

  1. Enter Proxy HostProxy PortProxy Username, and Proxy Password

  2. If your Data Lake is on premise, specify your On Premises Setting

  1. Enter the Premises Host. For On Premises Setting, only Shared Access Signatures authentication method is supported.
  2. Enter a name for your connection.
  3. Choose to share the authentication with others or not.
  4. Select Continue.

Transfer Your Data to Treasure Data

After creating the authenticated connection, you are automatically taken to Authentications.

  1. Search for the connection you created.
  2. Select New Source.
  3. Type a name for your Source in the Data Transfer field**.**
  4. Select Next. The Source Table dialog opens

  1. Edit the following parameters:
ParametersDescription
ContainerThe container name of the Data Lake
Path PrefixThe path to the folders container with all your files to ingest
Path Match Pattern (optional)Only import the files with this Regular Expression pattern
Sub folders are partitionsEnable to specify that you use Spark Partition folder structure. The sub folder must have a name in this format <column_name>=value
Enable Schema EvolutionEnable schema evolution for parquet
Incremental LoadingEnable incremental mode.
End timeOnly files that have been modified after the time specified will be imported.
Schema SettingsIf "Sub Folders Are Partitions" is enabled, you must provide the column name and data type of that partition column
Include ColumnsLimits the columns that will be imported. Only columns specified in the list will be imported. If the list is empty, all columns will be imported.
  1. Select Next. The Data Settings page can be modified for your needs or you can skip the page.

  2. Optionally, edit the following parameters:

ParameterDescription
Retry LimitMaximum number of retries
Initial retry interval in millisThe initial retry interval in milliseconds
Max retry wait in millisThe maximum retry interval. After the initial retry, the wait interval will be doubled until this maximum is reached.
repartition_numberSplit input file into smaller files to bypass Out Of Memory exception. This apply to large data files. Default value is 100.
  1. Select Next.

Data Preview

Data Preview is not supported with this integration. The preview only displays example data.

Data Placement

For data placement, select the target database and table where you want your data placed and indicate how often the import should run.

  1. Select Next. Under Storage, you will create a new or select an existing database and create a new or select an existing table for where you want to place the imported data.

  2. Select a Database > Select an existing or Create New Database.

  3. Optionally, type a database name.

  4. Select a TableSelect an existing or Create New Table.

  5. Optionally, type a table name.

  6. Choose the method for importing the data.

    • Append (default)-Data import results are appended to the table. If the table does not exist, it will be created.
    • Always Replace-Replaces the entire content of an existing table with the result output of the query. If the table does not exist, a new table is created.
    • Replace on New Data-Only replace the entire content of an existing table with the result output when there is new data.
  7. Select the Timestamp-based Partition Key column. If you want to set a different partition key seed than the default key, you can specify the long or timestamp column as the partitioning time. As a default time column, it uses upload_time with the add_time filter.

  8. Select the Timezone for your data storage.

  9. Under Schedule, you can choose when and how often you want to run this query.

Run once

  1. Select Off.
  2. Select Scheduling Timezone.
  3. Select Create & Run Now.

Repeat Regularly

  1. Select On.
  2. Select the Schedule. The UI provides these four options: @hourly@daily and @monthly or custom cron.
  3. You can also select Delay Transfer and add a delay of execution time.
  4. Select Scheduling Timezone.
  5. Select Create & Run Now.

After your transfer has run, you can see the results of your transfer in Data Workbench > Databases.

Optionally Configure Workflow

Within Treasure Workflow, you can specify the use of this data connector within a workflow.

Learn more at Using Workflows to Export Data with the TD Toolbelt.

Example Workflow for Azure Data Lake Storage Input

+setup:
  echo>: start ${session_time}

+import-with-sql:
  td_load>: config.yml
  database: ${td.some_database}
  table: ${td.some_table2}

+teardown:
  echo>: finish ${session_time}
  

Example (config.yml)

The following is an example configuration file to fetch files from Azure Data Lake:

input:    type: azure_datalake  authentication_mode: account_key  account_name: tdadl  account_key: fjZliu61iZV  sas_token: ?sv=sas_token  container_name: test  path_prefix:  /traffic_data/partition/collisionrecords2/  path_match_pattern: /traffic_data/partition/collisionrecords2/*  subfolder_partitions: true  proxy_type: none  proxy_host: host  proxy_port: 3128  proxy_username: tdpy  proxy_password: 321tre  repartition_number: 100  schema_evolution: false  incremental: true  last_updated_at: "2023-12-20T03:51:20.937Z"  include_columns: [          col0,          col3,          col4,      ]

Enabling Soft Delete for Blobs

When you enable the `"Soft delete for blobs" feature, the connector doesn't work, because it uses the REST API.

To use this feature, login in the Azure Portal, and from the Overview page, change the Properties of Data Lake Storage so that "Hierarchical namespace" is Enabled.

Error example looks like this.

Caused by: org.apache.hadoop.fs.azurebfs.contracts.exceptions.AbfsRestOperationException: Operation failed: "Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.", 403, HEAD, https://kfcusprdanalyticsadl.dfs.core.windows.net/data-science-container/?upn=false&action=getAccessControl&timeout=90&sp=racwdlmep&st=2024-08-26T15:03:55Z&se=2024-10-15T23:03:55Z&spr=https&sv=2022-11-02&sr=d&sig=XXXXX&sdd=2s, rId: 596f51a1-601f-000d-6fe0-f7dd2e000000