This feature is in BETA. Contact your Customer Success Representative for more information.


Microsoft Azure Data Lake Storage is the industrial leading storage solution for big data. Our import integration helps ingest the parquet files from your Azure Data Lake Storage into Treasure Data for consolidation with other sources set up in Treasure Data.

What can you do with this Integration?

  • Copy all existing data files: You can copy all of your Azure Data Lake parquet files into Treasure Data in order to migrate from the system.
  • Ingest data directly: Import data directly from the Azure Data Lake into Treasure Data instead of having to use a bridge system.



Prerequisites

  • Basic knowledge of Treasure Data

  • Basic knowledge of Microsoft Azure Data Lake

  • Microsoft Azure Data Lake account with enough accessibility to create Shared Access Signatures and download files 

Limitation

  • Only supports Azure Data Lake Storage (v2)
  • Support 1 level of partition keys as Spark Partition (https://spark.apache.org/docs/2.4.0/sql-data-sources-parquet.html#partition-discovery)
  • Only supports snappy compression codec
  • No data preview available due to time required to process the download and read parquet file schema
  • No incremental loading support (due to the characteristics of parquet file naming and file update with Spark)
  • Only supports the HTTP Proxy method

Obtain the Access Key or Shared Access Signature from Microsoft Azure Portal


1. Navigate to Storage Account management on your Azure Portal


2. Select Access keys or Shared access signatures
3. Copy the keys to use in the TD authentication configuration


Use the TD Console to Create Your Connection

Create a New Connection

In Treasure Data, you must create and configure the data connection prior to running your query. As part of the data connection, you provide authentication to access the integration.

1. Open TD Console.
2. Navigate to Integrations Hub  Catalog.
3. Search for and select Azure Data Lake.


4. Select Create Authentication.
5. Choose one of the following authentication methods:
  1. Select Account Key from the Authentication Mode dropdown menu.

  2. Enter your storage Account Name.
  3. Enter your Account Key copied from the Azure Portal.
  1. Select Shared Access Signature from the Authentication Mode dropdown menu.
  2. Enter your storage Account Name.
  3. Enter your Account Key copied from the Azure Portal.
  1. If you want to run through your HTTP Proxy, select your Proxy Type.
  2. Enter Proxy Host, Proxy Port, Proxy Username, and Proxy Password
  1. If your Data Lake is on premise, specify your On Premises Setting
  2. Enter the Premises Host.
    For On Premises Setting, only Shared Access Signatures authentication method is supported.
6. Enter a name for your connection.
7. Choose to share the authentication with others or not. 
8. Select Continue.



Transfer Your Data to Treasure Data

After creating the authenticated connection, you are automatically taken to Authentications.


1. Search for the connection you created. 
2. Select New Source.
3. Type a name for your Source in the Data Transfer field.
4. Select Next.

The Source Table dialog opens.

5. Edit the following parameters:
Parameters Description
ContainerThe container name of the Data Lake
Path PrefixThe path to the folders container with all your files to ingest
Path Match Pattern (optional)Only import the files with this Regular Expression pattern
Sub folders are partitionsEnable to specify that you use Spark Partition folder structure. The sub folder must have a name in this format <column_name>=<value>
Schema SettingsIf 'sub folders are partitions' is enabled, you must provide the column name and data type of that partition column
6. Select Next.

The Data Settings page can be modified for your needs or you can skip the page.

7. Optionally, edit the following parameters:
ParameterDescription
Retry Limit
Initial retry interval in millis
Max retry wait in millis
repartition_numberSplit input file into smaller files to bypass Out Of Memory exception. This apply to large data files. Default value is 100.
8. Select Next.


Data Preview 

Data Preview is not supported with this integration. The preview only displays example data.

Data Placement


For data placement, select the target database and table where you want your data placed and indicate how often the import should run.

  1.  Select Next. Under Storage you will create a new or select an existing database and create a new or select an existing table for where you want to place the imported data.

  2. Select a Database > Select an existing or Create New Database.

  3. Optionally, type a database name.

  4. Select a Table> Select an existing or Create New Table.

  5. Optionally, type a table name.

  6. Choose the method for importing the data.

    • Append (default)-Data import results are appended to the table.
      If the table does not exist, it will be created.

    • Always Replace-Replaces the entire content of an existing table with the result output of the query. If the table does not exist, a new table is created. 

    • Replace on New Data-Only replace the entire content of an existing table with the result output when there is new data.

  7. Select the Timestamp-based Partition Key column.
    If you want to set a different partition key seed than the default key, you can specify the long or timestamp column as the partitioning time. As a default time column, it uses upload_time with the add_time filter.

  8. Select the Timezone for your data storage.

  9. Under Schedule, you can choose when and how often you want to run this query.

    • Run once:
      1. Select Off.

      2. Select Scheduling Timezone.

      3. Select Create & Run Now.

    • Repeat the query:

      1. Select On.

      2. Select the Schedule. The UI provides these four options: @hourly, @daily and @monthly or custom cron.

      3. You can also select Delay Transfer and add a delay of execution time.

      4. Select Scheduling Timezone.

      5. Select Create & Run Now.

 After your transfer has run, you can see the results of your transfer in Data Workbench > Databases.


Optionally Configure Workflow

Within Treasure Workflow, you can specify the use of this data connector within a workflow.

Learn more at Using Workflows to Export Data with the TD Toolbelt.

Example Workflow for Azure Data Lake Storage Input


+setup:
  echo>: start ${session_time}

+import-with-sql:
  td_load>: config.yml
  database: ${td.some_database}
  table: ${td.some_table2}
  
+teardown:
  echo>: finish ${session_time}
   

Example (config.yml)

The following is an example configuration file to fetch files from Azure Data Lake:

input:  
  type: azure_datalake
  authentication_mode: account_key
  account_name: tdadl
  account_key: fjZliu61iZV
  sas_token: ?sv=sas_token
  container_name: test
  path_prefix:  /traffic_data/partition/collisionrecords2/
  path_match_pattern: /traffic_data/partition/collisionrecords2/*
  subfolder_partitions: true
  proxy_type: none
  proxy_host: host
  proxy_port: 3128
  proxy_username: tdpy
  proxy_password: 321tre
  repartition_number: 100








  • No labels