Skip to content
Last updated

Embulk Bulk Import From Csv Files

You can import data from CSV files into Treasure Data using Embulk, an open-source bulk data loader. Embulk enables you to transfer data between various databases, storage locations, file formats, and cloud services.

You can also import data from CSV files using using the Bulk Import program (td-import). Be advised that the td-import is not actively maintained and is a candidate for deprecation. Therefore, we strongly recommend using Embulk.

Prerequisites

  • Basic knowledge of Treasure Data.
  • Basic Knowledge of Embulk.
  • Embulk is a Java application. Make sure that Java is installed.
  • Follow the instructions in Installing Bulk Data Import.

Create a Seed Configuration File

Using your favorite text editor, create an Embulk config file (for eg:seed.yml) defining the input file and output Treasure Data parameters.

Example

in:
  type: file
  path_prefix: /path/to/files/sample_
out:
  type: td
  apikey: xxxxxxxxxxxx
  endpoint: api.treasuredata.com
  database: dbname
  table: tblname
  time_column: time
  mode: replace
  default_timestamp_format: '%Y-%m-%d %H:%M:%S'

This is Sample Data.

id,account,time,purchase,comment
1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,"Embulk ""csv"" parser plugin"
4,11270,2015-01-29 11:54:36,20150129,NULL

For further details about additional parameters available for embulk-local-file-input, refer to Embulk Local file input Also details about embulk-output-td, refer to the TD output plugin for Embulk.

Guess Fields (Generate load.yml)

Embulk guess option uses seed.yml to read the target file and automatically guesses the column types and settings, and creates a new file load.yml with this information.

embulk guess seed.yml -o load.yml

Generated load.yml file.

in:
  type: file
  path_prefix: /path/to/files/sample_
  'last_path:': /path/to/files/sample_02.csv
  parser:
    charset: UTF-8
    newline: CRLF
    type: csv
    delimiter: ','
    quote: '"'
    escape: '"'
    null_string: 'NULL'
    trim_if_not_quoted: false
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    - {name: id, type: long}
    - {name: account, type: long}
    - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
    - {name: purchase, type: timestamp, format: '%Y%m%d'}
    - {name: comment, type: string}
out: {type: td, apikey: xxxxx, endpoint: api.treasuredata.com, database: dbname, table: tblname, time_column: time, mode: replace, default_timestamp_format: '%Y-%m-%d %H:%M:%S'}

Best Practice: Add the "auto_create_table: true" parameter to the load.yml, so that tables that do not exist are automatically.

This is a sample of the auto_create_table parameter in a .yml file.

out:
  type: td
  apikey: your apikey
  endpoint: api.treasuredata.com
  database: dbname
  table: tblname
  time_column: created_at
  auto_create_table: true
  mode: append

You must create the database and table in TD, prior to executing the load job.

  • Alternative: If you:

    • must add a database
    • do not add the auto_create_table parameter in a .yml file and must add a table

run the following TD commands:

td database:create dbname
td table:create dbname tblname

You can also create the database and table using TD Console.

You can preview the data using embulk preview load.yml command. If any of the column types or data seems incorrect you can edit load.yml file directly and preview again to verify. If the guess option doesn’t yield satisfactory results, you can change parameters in load.yml according to your requirements manually by using CSV/TSV parser plugin options.

embulk preview load.yml

Execute Load Job

Issue the import job by running the following command:

embulk run load.yml

It may take a few minutes to several hours for the job to complete, depending on the size of the data.