Embulk for local file

This article explains how to import files from your local machine to Treasure Data using embulk Local file input plugin.

Table of Contents

Prerequisites

  • Basic knowledge of Treasure Data.
  • Basic knowledge of Embulk. (Local file input is default plugin)
  • Embulk and embulk-output-td plugin installed on your machine.

Step 1: Create seed configuration file

Using your favorite text editor, create embulk config file (for eg:seed.yml) defining input(file) and ouput(TD) parameters. Example:

in:
    type: file
    path_prefix: /path/to/files/sample_    # path of *.csv or *.tsv file on your local machine
out:
    type: td
    apikey: xxxxxxxxxxxx
    endpoint: api.treasuredata.com
    database: dbname
    table: tblname
    time_column: time
    mode: replace
    #by default mode: append is used, if not defined. Imported records are appended to the target table with this mode.
    #mode: replace, replaces existing target table
    default_timestamp_format: '%Y-%m-%d %H:%M:%S'

This is Sample Data.

    id,account,time,purchase,comment
    1,32864,2015-01-27 19:23:49,20150127,embulk
    2,14824,2015-01-27 19:01:23,20150127,embulk jruby
    3,27559,2015-01-28 02:20:02,20150128,"Embulk ""csv"" parser plugin"
    4,11270,2015-01-29 11:54:36,20150129,NULL

For further details about additional parameters available for embulk-local-file-input, refer Embulk Local file input Also details about embulk-output-td, refer TD output plugin for Embulk

Step 2: Guess Fields (Generate load.yml)

Embulk guess option uses seed.yml to read the target file and automatically guesses the column types/settings and creates a new file load.yml with this information.

$ embulk guess seed.yml -o load.yml

Generated load.yml file.

    :::yaml
    in:
    type: file
    path_prefix: /path/to/files/sample_
    'last_path:': /path/to/files/sample_02.csv
    parser:
      charset: UTF-8
      newline: CRLF
      type: csv
      delimiter: ','
      quote: '"'
      escape: '"'
      null_string: 'NULL'
      trim_if_not_quoted: false
      skip_header_lines: 1
      allow_extra_columns: false
      allow_optional_columns: false
      columns:
      - {name: id, type: long}
      - {name: account, type: long}
      - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
      - {name: purchase, type: timestamp, format: '%Y%m%d'}
      - {name: comment, type: string}
    out: {type: td, apikey: xxxxx, endpoint: api.treasuredata.com,
      database: dbname, table: tblname, time_column: time, mode: replace, default_timestamp_format: '%Y-%m-%d %H:%M:%S'}

Now, you may preview the data using embulk preview load.yml command. If any of the column types or data seems incorrect you may edit load.yml file directly and preview again to verify. If guess option doesn’t yield satisfactory results, you may change parameters in load.yml according to your requirement manually using CSV/TSV parser plugin options.

  ::::terminal
  $ embulk preview load.yml

  +---------+--------------+-------------------------+-------------------------+----------------------------+
  | id:long | account:long |          time:timestamp |      purchase:timestamp |             comment:string |
  +---------+--------------+-------------------------+-------------------------+----------------------------+
  |       1 |       32,864 | 2015-01-27 19:23:49 UTC | 2015-01-27 00:00:00 UTC |                     embulk |
  |       2 |       14,824 | 2015-01-27 19:01:23 UTC | 2015-01-27 00:00:00 UTC |               embulk jruby |
  |       3 |       27,559 | 2015-01-28 02:20:02 UTC | 2015-01-28 00:00:00 UTC | Embulk "csv" parser plugin |
  |       4 |       11,270 | 2015-01-29 11:54:36 UTC | 2015-01-29 00:00:00 UTC |                            |
  +---------+--------------+-------------------------+-------------------------+----------------------------+
Untitled-3
You will need to create the database and table in TD, prior to executing the load job.

To do this:

$ td database:create dbname
$ td table:create dbname tblname

Or alternatively, you may create the database and table via Treasure Data Console

Step 3: Execute Load Job

Finally, issue the import job by running the following command:

$ embulk run load.yml

It may take few mins to hours for the job to complete, depending on the size of the data.


Last modified: Aug 01 2016 08:38:18 UTC

If this article is incorrect or outdated, or omits critical information, please let us know. For all other issues, please see our support channels.