Open the Data Connector for Microsoft Azure Blob Storage enables the import of the contents of .tsv and .csv files stored in your Azure Blob Storage container.
Connector UI limitations. Editing with the Connector UI has many limitations. We suggest using CLI for your edits.
Basic knowledge of Treasure Data
A Microsoft Azure Platform account
Configure the Connection
You can submit a DataConnector for Microsoft Azure Blob Storage from the Connector UI.
Create a new Microsoft Azure Blob Storage connector
First, you must register the connector by setting the following parameters:
Storage Account name:The name of your Microsoft Azure Blob Storage account.
Primary access key: The access key used to access your Microsoft Azure Blob Storage account.
With the proxy setting enabled
Transfer data from Microsoft Azure Storage
Next, create “New Transfer” on the My Connections page. You can prepare an ad hoc DataConnector job or an schedule DataConnector job. Complete the following steps.
Register the information that you want to ingest.
Container: Azure cloud storage container name (Ex. your_cont)
Path Prefix: prefix of target keys. (Ex. logs/data_)
Path Regex: regexp to match file paths. If a file path doesn’t match with this pattern, the file is skipped. (Ex. .csv$ # in this case, a file is skipped if its path doesn’t match with this pattern)
Amazon CloudFront is a web service that speeds up the distribution of your static and dynamic web content. You can configure CloudFront to create log files that contain detailed information about every user request that CloudFront receives. If you enable logging, you can save CloudFront log files, shown as follows:
In this case, “Fetch from” setting should be as shown:
Path Prefix: logging/
Path Regex: .gz$ (Not Required)
You can see a preview of data you configured. If you are unable to see the preview or have any issues viewing the preview, contact support.
The preview command will download one file from the specified bucket and display the results from that file. This may cause a difference in results from the preview and issue commands.
If you want to set a specified column name, select Advanced Settings.
Advanced Settings allows you to edit guessed properties. Edit the following section, if you need to.
Default timezone: Changes Time zone of timestamp columns if the value itself doesn’t include time zone.
Name: Changes the name of the column. Supported characters for column names are lowercase alphabets, numbers, and “_” (underscore) only.
Type: Parses a value as a specified type and stores the type as part of the Treasure Data schema.
timestamp: imported as String type at Treasure Data (Ex. 2017-04-01 00:00:00.000)
Total file count limit: maximum number of files to read. (optional)
In this phase, select your target database and table that you want to import to. You can create a new database or table using the
Create new database or
Create new table checkboxes.
Mode: Append – Allows you to add records into the existing table.
Mode: Replace – Replace the existing data in the table with the data being imported.
Partition key Seed: Choose the long or timestamp column that you would like to use as the partitioning time column. If you do not specify a time column, the upload time of the transfer is used in conjunction with the addition of an add_time filter.
In this phase, you can set an ad hoc or schedule configuration for your job.
Once now: Run the transfer only once.
Schedule: accepts these three options:
Delay Transfer: add a delay of execution time.
Data Storage Timezone: Timezone the data is stored in; data will also be displayed in this timezone. Supports extended timezone formats like ‘Asia/Tokyo’.
After selecting the frequency, select Start Transfer to begin the transfer. If there are no errors, the transfer into Treasure Data will complete and the data will be available. Jobs are kicked off when a transfer runs. You can use the Jobs or the My Input Transfers section to monitor the progress of your data transfer.
Troubleshoot Data Import
Review the job log. Warning and errors provide information about the success of your import. For example, you can identify the source file names associated with import errors.
Use the CLI to Configure the Connector
You can also use the Microsoft Azure Blob Storage data connector from the command line interface. The following instructions show you how to import data using the CLI.
Install ‘td’ command v0.11.9 or later
Install the newest Treasure Data Toolbelt.
Create Seed Config File (seed.yml)
First, prepare seed.yml as shown in the following example, with your account information (Check about Azure storage accounts). You must also specify container name and target file name (or prefix for multiple files).
The Data Connector for Microsoft Azure Blob Storage imports all files that match a specified prefix. (e.g. path_prefix:
For more details on available out modes, review the Appendix below.
4.3. Guess Fields (Generate load.yml)
Second, use connector:guess. This command automatically reads the source file, and assesses (uses logic to guess) the file format.
If you open up load.yml, you’ll see the guessed file format definitions including file formats, encodings, column names, and types.
Then, you can preview how the system will parse the file by using the preview command.
The guess command needs over 3 rows and 2 columns in source data file, because it guesses column definition using sample rows from source data.
If the system detects your column name or column type unexpectedly, modify load.yml directly and preview again.
Currently, the data connector supports parsing of “boolean”, “long”, “double”, “string”, and “timestamp” types.
You also must create a local database and table prior to executing the data load job. Follow these steps:
4.4. Execute Load Job
Finally, submit the load job. It may take a couple of hours depending on the size of the data. Specify the Treasure Data database and table where the data should be stored.
It’s also recommended to specify --time-column option, because Treasure Data’s storage is partitioned by time (see data partitioning). If the option is not provided, the data connector chooses the first long or timestamp column as the partitioning time. The type of the column specified by --time-column must be either of long and timestamp type.
If your data doesn’t have a time column you can add a time column by using add_time filter option. For more details see add_time filter plugin
The connector:issue command assumes that you have already created a database(td_sample_db)and a table(td_sample_table). If the database or the table do not exist in TD, the connector:issue command fails, so create the database and table manually or use --auto-create-table option with td connector:issue command to auto create the database and table:
At present, the data connector does not sort records on server-side. To use time-based partitioning effectively, sort records in files beforehand.
If you have a field called time, you don’t have to specify the --time-column option.
You can schedule periodic Data Connector execution for incremental Microsoft Azure Blob Storage file imports. We configure our scheduler carefully to ensure high availability. By using this feature, you no longer need a cron daemon on your local data center.
For the scheduled import, the Data Connector for Microsoft Azure Blob Storage imports all files that match with the specified prefix (e.g. path_prefix:
path/to/sample_201505.csv.gz) at first and remembers the last path (
path/to/sample_201505.csv.gz) for the next execution.
On the second and subsequent runs, the connector imports only files that comes after the last path in alphabetical (lexicographic) order. (
Create the schedule
A new schedule can be created using the td connector:create command. The following are required: the name of the schedule, the cron-style schedule, the database and table where their data will be stored, and the Data Connector configuration file.
It’s also recommended to specify the --time-column option, because Treasure Data’s storage is partitioned by time (see data partitioning).
The `cron` parameter also accepts three special options: `@hourly`, `@daily` and `@monthly`.
By default, schedule is setup in UTC timezone. You can set the schedule in a timezone using -t or --timezone option. Note that `--timezone` option supports only extended timezone formats like 'Asia/Tokyo', 'America/Los_Angeles' etc. Timezone abbreviations like PST, CST are *not* supported and may lead to unexpected schedules.
List the Schedules
You can see the list of currently scheduled entries by running the command td connector:list.
Show the Settings and Schedule History
td connector:show shows the execution settings of a schedule entry.
td connector:history shows the execution history of a schedule entry. To investigate the results of each individual run, use td job <jobid>.
Delete the Schedule
td connector:delete removes the schedule.
A) Modes for out plugin
You can specify file import mode in out section of seed.yml.
This is the default mode and records are appended to the target table.
replace (In td 0.11.10 and later)
This mode replaces data in the target table. Note that any manual schema changes made to the target table remain intact with this mode.