You can import data from TSV files into Treasure Data using Embulk, an open-source bulk data loader.


Continue to the following topics:


Prerequisites

  • Basic knowledge of Treasure Data.

  • Basic Knowledge of Embulk.


What is Embulk?

Embulk enables you to transfer data between various databases, storage locations, file formats, and cloud services.


How to Install Embulk


Linux, Mac and BSD

Embulk is a Java application. Make sure that Java is installed.

Following 4 commands install Embulk to your home directory:

curl --create-dirs -o ~/.embulk/bin/embulk -L "http://dl.embulk.org/embulk-latest.jar"
chmod +x ~/.embulk/bin/embulk
echo 'export PATH="$HOME/.embulk/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc


Windows

Embulk is a Java application. Make sure that Java is installed.

You can download embulk.bat using the following command on cmd.exe or PowerShell.exe:

PowerShell -Command \
  "& {Invoke-WebRequest http://dl.embulk.org/embulk-latest.jar -OutFile embulk.bat}"


How to install Treasure Data Plugin

You can use plugins to load data from and to various systems and file formats. Select to view a list of publicly released plugins: list of plugins by category.

The following command installs embulk-output-td plugin, which imports records to Treasure Data.

embulk gem install embulk-output-td


Create a Seed Configuration File

Using your favorite text editor, create embulk config file (for eg:seed.yml) defining input(file) and ouput(TD) parameters. Example:

in:
    type: file
    path_prefix: /path/to/files/sample_    # path of *.csv or *.tsv file on your local machine
out:
    type: td
    apikey: xxxxxxxxxxxx
    endpoint: api.treasuredata.com
    database: dbname
    table: tblname
    time_column: time
    mode: replace
    #by default mode: append is used, if not defined. Imported records are appended to the target table with this mode.
    #mode: replace, replaces existing target table
    default_timestamp_format: '%Y-%m-%d %H:%M:%S'

This is Sample Data.

    id,account,time,purchase,comment
    1,32864,2015-01-27 19:23:49,20150127,embulk
    2,14824,2015-01-27 19:01:23,20150127,embulk jruby
    3,27559,2015-01-28 02:20:02,20150128,"Embulk ""TSV"" parser plugin"
    4,11270,2015-01-29 11:54:36,20150129,NULL

For further details about additional parameters available for embulk-local-file-input, refer to Embulk Local file input Also details about embulk-output-td, refer to the TD output plugin for Embulk.


Guess Fields (Generate load.yml)

Embulk guess option uses seed.yml to read the target file and automatically guesses the column types and settings, and creates a new file load.yml with this information.

$ embulk guess seed.yml -o load.yml

Generated load.yml file.

    :::yaml
    in:
    type: file
    path_prefix: /path/to/files/sample_
    'last_path:': /path/to/files/sample_02.tsv
    parser:
      charset: UTF-8
      newline: CRLF
      type: tsv
      delimiter: ','
      quote: '"'
      escape: '"'
      null_string: 'NULL'
      trim_if_not_quoted: false
      skip_header_lines: 1
      allow_extra_columns: false
      allow_optional_columns: false
      columns:
      - {name: id, type: long}
      - {name: account, type: long}
      - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
      - {name: purchase, type: timestamp, format: '%Y%m%d'}
      - {name: comment, type: string}
    out: {type: td, apikey: xxxxx, endpoint: api.treasuredata.com,
      database: dbname, table: tblname, time_column: time, mode: replace, default_timestamp_format: '%Y-%m-%d %H:%M:%S'}

Best Practice: Add the "auto_create_table: true" parameter to the load.yml, so that tables that do not exist are automatically.

This is a sample of the auto_create_table parameter in a .yml file.

out:
  type: td
  apikey: <your apikey>
  endpoint: api.treasuredata.com
  database: dbname
  table: tblname
  time_column: created_at
  auto_create_table: true
  mode: append

You must create the database and table in TD, prior to executing the load job. Alternative: If you either: 1) must add a database or 2) do not add the auto_create_table parameter in a .yml file and must add a table, run the following TD commands:

$ td database:create dbname
$ td table:create dbname tblname

You can also create the database and table using TD Console.

Now, you may preview the data using embulk preview load.yml command. If any of the column types or data seems incorrect you can edit load.yml file directly and preview again to verify. If the guess option doesn’t yield satisfactory results, you can change parameters in load.yml according to your requirements manually by using CSV/TSV parser plugin options.

  ::::terminal
  $ embulk preview load.yml

  +---------+--------------+-------------------------+-------------------------+----------------------------+
  | id:long | account:long |          time:timestamp |      purchase:timestamp |             comment:string |
  +---------+--------------+-------------------------+-------------------------+----------------------------+
  |       1 |       32,864 | 2015-01-27 19:23:49 UTC | 2015-01-27 00:00:00 UTC |                     embulk |
  |       2 |       14,824 | 2015-01-27 19:01:23 UTC | 2015-01-27 00:00:00 UTC |               embulk jruby |
  |       3 |       27,559 | 2015-01-28 02:20:02 UTC | 2015-01-28 00:00:00 UTC | Embulk "TSV" parser plugin |
  |       4 |       11,270 | 2015-01-29 11:54:36 UTC | 2015-01-29 00:00:00 UTC |                            |
  +---------+--------------+-------------------------+-------------------------+----------------------------+


Execute Load Job

Finally, issue the import job by running the following command:

$ embulk run load.yml

It may take a few minutes to hours for the job to complete, depending on the size of the data.


Appendix

You can also import data from TSV files using the Bulk Import program (td-import). However, be advised that the td-import is not actively maintained and is a candidate for deprecation in the future. Therefore, we strongly recommend using Embulk.

  • No labels