# Microsoft Azure Blob Storage Import Integration

[Learn more about Microsoft Azure Blob Storage Export Integration](/int/microsoft-azure-blob-storage-export-integration).

Open the Data Connector for Microsoft Azure Blob Storage enables the import of the contents of *.tsv and* .csv files stored in your Azure Blob Storage container.

# Limitations

**Connector UI limitations**. Editing with the Connector UI has many limitations. We suggest using CLI for your edits.

The incremental flow will not work properly if Hierarchical Namespaces enabled.

## Prerequisites

- Basic knowledge of Treasure Data
- A Microsoft Azure Platform account


## Configure the Connection

You can submit a DataConnector for Microsoft Azure Blob Storage from the [Connector UI](https://console.treasuredata.com/app/connections).

![](/assets/image-20191015-154021.a5cc6ed2d6c8814d389e469db0eee27e0144a91e90a2a1bb2f1573c020746a54.743883bd.png)

## Create a new Microsoft Azure Blob Storage connector

First, you must register the connector by setting the following parameters:

- **Storage Account name**:The name of your Microsoft Azure Blob Storage account.
- **Primary access key:** The access key used to access your Microsoft Azure Blob Storage account.


![](/assets/image-20191015-154039.ffdcfb7e9cd7929c00d3c2a69133a08267aff7678c0aa1977b9cacd943b16850.743883bd.png)

With the proxy setting enabled

![](/assets/image2021-9-20_13-54-47.a9d4f91ae596b2cc09993d4d6af92ac72d25a04bbb1af94e1f623422c181929a.743883bd.png)

## Transfer data from Microsoft Azure Storage

Next, create “New Transfer” on the My Connections page. You can prepare an ad hoc DataConnector job or an schedule DataConnector job. Complete the following steps.

![](/assets/image-20191015-154058.379b76779f527bd1c574756b4a413be14a62c4ed61182febca7b2ff6f9ad01f4.743883bd.png)

### Fetch from

Register the information that you want to ingest.

- **Container**: Azure cloud storage container name (Ex. *your_cont*)
- **Path Prefix**: prefix of target keys. (Ex. *logs/data_*)
- **Path Regex**: regexp to match file paths. If a file path doesn’t match with this pattern, the file is skipped. (Ex. *.csv$* # in this case, a file is skipped if its path doesn’t match with this pattern)


##### Example: CloudFront

Amazon CloudFront is a web service that speeds up the distribution of your static and dynamic web content. You can configure CloudFront to create log files that contain detailed information about every user request that CloudFront receives. If you enable logging, you can save CloudFront log files, shown as follows:


```
[your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-15.a103fd5a.gz]
[your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-15.b2aede4a.gz]
[your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-16.594fa8e6.gz]
[your_bucket] - [logging] - [E231A697YXWD39.2017-04-23-16.d12f42f9.gz]
```

In this case, “Fetch from” setting should be as shown:

- **Container**: your_container
- **Path Prefix**: logging/
- **Path Regex**: *.gz$* (Not Required)


![](/assets/image-20191015-154114.8752d03745b3b93f6ef2fe555ee753273ac81b6cfccaaca9db12362215dd7591.743883bd.png)

### Preview

You can see a preview of data you configured. If you are unable to see the preview or have any issues viewing the preview, contact [support.](mailto:support@treasuredata.com)

![](/assets/image-20191209-211644.78d7eda5add6ed363b9050c68fcb1b7e6fcf1a3e0ed6f3f666e792ad54a83013.743883bd.png)

The preview command will download one file from the specified bucket and display the results from that file. This may cause a difference in results from the preview and issue commands.

If you want to set a specified column name, select **Advanced Settings**.

#### Advanced Settings

Advanced Settings allows you to edit guessed properties. Edit the following section, if you need to.

- **Default timezone**: Changes Time zone of timestamp columns if the value itself doesn’t include time zone.
- **Columns**:
  - **Name**: Changes the name of the column. Supported characters for column names are lowercase alphabets, numbers, and “_” (underscore) only.
  - **Type**: Parses a value as a specified type and stores the type as part of the Treasure Data schema.
    - **boolean**
    - **long**
    - **timestamp**: imported as String type at Treasure Data (Ex. 2017-04-01 00:00:00.000)
    - **double**
    - **string**
    - **json**
- **Total file count limit**: maximum number of files to read. (optional)


### Transfer to

In this phase, select your target database and table that you want to import to. You can create a new database or table using the `Create new database` or `Create new table` checkboxes.

- **Mode**: Append – Allows you to add records into the existing table.
- **Mode**: Replace – Replace the existing data in the table with the data being imported.
- **Partition key Seed**: Choose the long or timestamp column that you would like to use as the partitioning time column. If you do not specify a time column, the upload time of the transfer is used in conjunction with the addition of an add_time filter.


![](/assets/image-20191015-154201.3b19c127a567d44468ddb9d0ec5b68c35be196f47ea4a1c4c330d9e7eacb0f54.743883bd.png)

### When

In this phase, you can set an ad hoc or schedule configuration for your job.

- When
  - **Once now**: Run the transfer only once.
  - **Repeat…**
    - **Schedule**: accepts these three options: `@hourly`, `@daily`and `@monthly`and custom `cron`.
    - **Delay Transfer**: add a delay of execution time.
  - **Data Storage Timezone**: Timezone the data is stored in; data will also be displayed in this timezone. Supports extended timezone formats like ‘Asia/Tokyo’.


![](/assets/image-20191015-154222.2e3ccc0223d646f83485d513f7eb860b0e0d9c7338cb1df4fdd8cf50f85fb185.743883bd.png)

After selecting the frequency, select **Start Transfer** to begin the transfer. If there are no errors, the transfer into Treasure Data will complete and the data will be available. Jobs are kicked off when a transfer runs. You can use the Jobs or the My Input Transfers section to monitor the progress of your data transfer.

# Troubleshoot Data Import

Review the job log. Warning and errors provide information about the success of your import. For example, you can [identify the source file names associated with import errors](https://docs.treasuredata.com/smart/project-product-documentation/data-import-error-troubleshooting).

# Use the CLI to Configure the Connector

You can also use the Microsoft Azure Blob Storage data connector from the command line interface. The following instructions show you how to import data using the CLI.

## Install ‘td’ command v0.11.9 or later

Install the newest [Treasure Data Toolbelt.](https://toolbelt.treasuredata.com/)


```
$ td --version
0.11.10
```

## Create Seed Config File (seed.yml)

First, prepare *seed.yml* as shown in the following example, with your account information (Check [about Azure storage accounts](https://azure.microsoft.com/en-us/documentation/articles/storage-create-storage-account/)). You must also specify container name and target file name (or prefix for multiple files).


```
in:
  type: azure_blob_storage
  account_name: myaccount
  account_key: myaccount_key
  container: my-container
  path_prefix: logs/csv-
out:
  mode: append
```

The Data Connector for Microsoft Azure Blob Storage imports all files that match a specified prefix. (e.g. path_prefix: `path/to/sample_` –> `path/to/sample_201501.csv.gz`, `path/to/sample_201502.csv.gz`, …, `path/to/sample_201505.csv.gz`)

For more details on available *out* modes, review the Appendix below.

## 4.3. Guess Fields (Generate load.yml)

Second, use *connector:guess*. This command automatically reads the source file, and assesses (uses logic to guess) the file format.


```
$ td connector:guess seed.yml -o load.yml
```

If you open up *load.yml*, you’ll see the guessed file format definitions including file formats, encodings, column names, and types.


```yaml
in:
  type: azure_blob_storage
  account_name: myaccount
  account_key: myaccount_key
  container: my-container
  path_prefix: logs/csv-
  decoders:
  - {type: gzip}
  parser:
    charset: UTF-8
    newline: CRLF
    type: csv
    delimiter: ','
    quote: '"'
    header_line: true
    columns:
    - {name: id, type: long}
    - {name: account, type: long}
    - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
    - {name: purchase, type: timestamp, format: '%Y%m%d'}
    - {name: comment, type: string}
out:
  mode: append
```

Then, you can preview how the system will parse the file by using the *preview* command.


```
$ td connector:preview load.yml
+-------+---------+----------+---------------------+
| id    | company | customer | created_at          |
+-------+---------+----------+---------------------+
| 11200 | AA Inc. |    David | 2015-03-31 06:12:37 |
| 20313 | BB Imc. |      Tom | 2015-04-01 01:00:07 |
| 32132 | CC Inc. | Fernando | 2015-04-01 10:33:41 |
| 40133 | DD Inc. |    Cesar | 2015-04-02 05:12:32 |
| 93133 | EE Inc. |     Jake | 2015-04-02 14:11:13 |
+-------+---------+----------+---------------------+
```

|  |
|  --- |
| The guess command needs over 3 rows and 2 columns in source data file, because it guesses column definition using sample rows from source data. |


If the system detects your column name or column type unexpectedly, modify *load.yml* directly and preview again.

Currently, the data connector supports parsing of “boolean”, “long”, “double”, “string”, and “timestamp” types.

You also must create a local database and table prior to executing the data load job. Follow these steps:


```
$ td database:create td_sample_db
$ td table:create td_sample_db td_sample_table
```

## 4.4. Execute Load Job

Finally, submit the load job. It may take a couple of hours depending on the size of the data. Specify the Treasure Data database and table where the data should be stored.

It’s also recommended to specify *--time-column* option, because Treasure Data’s storage is partitioned by time (see [data partitioning](https://docs.treasuredata.com/smart/project-product-documentation/data-partitioning-in-treasure-data)). If the option is not provided, the data connector chooses the first *long* or *timestamp* column as the partitioning time. The type of the column specified by *--time-column* must be either of *long* and *timestamp* type.

If your data doesn’t have a time column you can add a time column by using *add_time* filter option. For more details see [add_time filter plugin](https://docs.treasuredata.com/smart/project-product-documentation/add_time-filter-function)


```
$ td connector:issue load.yml --database td_sample_db --table td_sample_table \
  --time-column created_at
```

The connector:issue command assumes that you have already created a *database(td_sample_db)*and a *table(td_sample_table)*. If the database or the table do not exist in TD, the connector:issue command fails, so create the database and table [manually](https://docs.treasuredata.com/smart/project-product-documentation/data-management) or use *--auto-create-table* option with *td connector:issue* command to auto create the database and table:


```
$ td connector:issue load.yml --database td_sample_db --table td_sample_table --time-column created_at --auto-create-table
```

|  |
|  --- |
| At present, the data connector does not sort records on server-side. To use time-based partitioning effectively, sort records in files beforehand. |


If you have a field called *time*, you don’t have to specify the *--time-column* option.


```
$ td connector:issue load.yml --database td_sample_db --table td_sample_table
```

# Scheduled execution

You can schedule periodic Data Connector execution for incremental Microsoft Azure Blob Storage file imports. We configure our scheduler carefully to ensure high availability. By using this feature, you no longer need a *cron* daemon on your local data center.

For the scheduled import, the Data Connector for Microsoft Azure Blob Storage imports all files that match with the specified prefix (e.g. path_prefix: `path/to/sample_` –> `path/to/sample_201501.csv.gz`, `path/to/sample_201502.csv.gz`, …, `path/to/sample_201505.csv.gz`) at first and remembers the last path (`path/to/sample_201505.csv.gz`) for the next execution.

On the second and subsequent runs, the connector imports only files that comes after the last path in alphabetical (lexicographic) order. (`path/to/sample_201506.csv.gz`, …).

## Create the schedule

A new schedule can be created using the *td connector:create* command. The following are required: the name of the schedule, the cron-style schedule, the database and table where their data will be stored, and the Data Connector configuration file.


```
$ td connector:create \
    daily_import \
    "10 0 * * *" \
    td_sample_db \
    td_sample_table \
    load.yml
```

It’s also recommended to specify the *--time-column* option, because Treasure Data’s storage is partitioned by time (see [data partitioning](https://docs.treasuredata.com/smart/project-product-documentation/data-partitioning-in-treasure-data)).


```
$ td connector:create \
    daily_import \
    "10 0 * * *" \
    td_sample_db \
    td_sample_table \
    load.yml \
    --time-column created_at
```

|  |
|  --- |
| The `cron` parameter also accepts three special options: `@hourly`, `@daily` and `@monthly`. |


| --- |

## List the Schedules

You can see the list of currently scheduled entries by running the command *td connector:list*.


```
$ td connector:list
+--------------+------------+----------+-------+--------------+-----------------+-------------------------------------------+
| Name         | Cron       | Timezone | Delay | Database     | Table           | Config                                    |
+--------------+------------+----------+-------+--------------+-----------------+-------------------------------------------+
| daily_import | 10 0 * * * | UTC      | 0     | td_sample_db | td_sample_table | {"in"=>{"type"=>"azure_blob_storage", ... |
+--------------+------------+----------+-------+--------------+-----------------+-------------------------------------------+
```

## Show the Settings and Schedule History

*td connector:show* shows the execution settings of a schedule entry.


```
% td connector:show daily_import
Name     : daily_import
Cron     : 10 0 * * *
Timezone : UTC
Delay    : 0
Database : td_sample_db
Table    : td_sample_table
```

*td connector:history* shows the execution history of a schedule entry. To investigate the results of each individual run, use *td job jobid*.


```
% td connector:history daily_import
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
| JobID  | Status  | Records | Database     | Table           | Priority | Started                   | Duration |
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
| 578066 | success | 10000   | td_sample_db | td_sample_table | 0        | 2015-04-18 00:10:05 +0000 | 160      |
| 577968 | success | 10000   | td_sample_db | td_sample_table | 0        | 2015-04-17 00:10:07 +0000 | 161      |
| 577914 | success | 10000   | td_sample_db | td_sample_table | 0        | 2015-04-16 00:10:03 +0000 | 152      |
| 577872 | success | 10000   | td_sample_db | td_sample_table | 0        | 2015-04-15 00:10:04 +0000 | 163      |
| 577810 | success | 10000   | td_sample_db | td_sample_table | 0        | 2015-04-14 00:10:04 +0000 | 164      |
| 577766 | success | 10000   | td_sample_db | td_sample_table | 0        | 2015-04-13 00:10:04 +0000 | 155      |
| 577710 | success | 10000   | td_sample_db | td_sample_table | 0        | 2015-04-12 00:10:05 +0000 | 156      |
| 577610 | success | 10000   | td_sample_db | td_sample_table | 0        | 2015-04-11 00:10:04 +0000 | 157      |
+--------+---------+---------+--------------+-----------------+----------+---------------------------+----------+
8 rows in set
```

## Delete the Schedule

*td connector:delete* removes the schedule.


```
$ td connector:delete daily_import
```

# Appendix

## A) Modes for out plugin

You can specify file import mode in *out* section of seed.yml.

### append (default)

This is the default mode and records are appended to the target table.


```
in:
  ...
out:
  mode: append
```

### replace (In td 0.11.10 and later)

This mode replaces data in the target table. Note that any manual schema changes made to the target table remain intact with this mode.


```
in:
  ...
out:
  mode: replace
```

## B) Proxy Setting


```
in:    
  type: azure_blob_storage
  account_name: myaccount
  account_key: myaccount_key
  container: my-container
  path_prefix: logs/csv-
  proxy:
	type: http
	host: 201.202.203.10
	port: 8080
	user: test
	password: test
```

## B) Incremental loading by last path


```
in:      type: azure_blob_storage  account_name: myaccount  account_key: myaccount_key  container: my-container  path_prefix: logs/csv-  incremental: true  use_modified_time: false  last_path: logs/csv-123.csv
```

## C) Incremental loading by last modified time


```
in:      type: azure_blob_storage  account_name: myaccount  account_key: myaccount_key  container: my-container  path_prefix: logs/csv-  incremental: true  use_modified_time: true  last_modified_time: 2025-04-09T00:00:00.000Z
```