Bulk Import from CSV file

This article explains how to import data from CSV files to Treasure Data.

Table of Contents

Install Bulk Loader

First, please install the toolbelt, which includes bulk loader program, on your computer.

Downloads

After the installation, the td command will be installed on your computer. Open up the terminal, and type td to execute the command. Also, please make sure you have java as well. Execute td import:jar_update to download the up-to-date version of our bulk loader:

$ td
usage: td [options] COMMAND [args]
$ java
Usage: java [-options] class [args...]
$ td import:jar_update
Installed td-import.jar 0.x.xx into /path/to/.td/java

Authenticate

Please login to your Treasure Data account.

$ td account -f
Enter your Treasure Data credentials.
Email: xxxxx
Password (typing will be hidden): 
Authenticated successfully.
Use 'td db:create <db_name>' to create a database.

Importing data from a CSV

Suppose you have a file called data.csv and its contents look like this:

$ head -n 5 data.csv
"host","log_name","date_time","method","url","res_code","bytes","referer","user_agent"
"64.242.88.10","-","2004-03-07 16:05:49","GET","/twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables",401,12846,"",""
"64.242.88.10","-","2004-03-07 16:06:51","GET","/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2",200,4523,"",""
"64.242.88.10","-","2004-03-07 16:10:02","GET","/mailman/listinfo/hsdivision",200,6291,"",""
"64.242.88.10","-","2004-03-07 16:11:58","GET","/twiki/bin/view/TWiki/WikiSyntax",200,7352,"",""

Then, execute the following commands to upload the CSV file.

$ td db:create my_db
$ td table:create my_db my_tbl
$ td import:auto \
  --format csv --column-header \
  --time-column date_time \
  --time-format "%Y-%m-%d %H:%M:%S" \
  --auto-create my_db.my_tbl \
  ./data.csv
Untitled-3
Because `td import:auto` executes MapReduce jobs to check the invalid rows, it'll take at least 1-2 minutes.
Untitled-3
If the column chosen for `--time-column` is in epoch timestamp (or unix time), you don't need the `--time-format` flag.

In the above command, we assumed that:

  • The data file is called data.csv and is located in the current directory (hence ./data.csv)
  • The first line in the file indicates the column names, hence we specify the --column-header option. If the file does not have the column names in the first row, you will have to specify the column names with the --columns option (and optionally the column types with --column-types option), or use the --column-types for each column in the file.
  • The time field is called “date_time” and it’s specified with the --time-column option
  • The time format is %Y-%m-%d %H:%M:%S and it’s specified with the --time-format option

If you don’t have a header on the first line, you need specify the column names manually by --columns option.

$ td import:auto \
  --format csv --columns host,log_name,date_time,method,url,res_code,bytes,referer,user_agent \
  --time-column date_time \
  --time-format "%Y-%m-%d %H:%M:%S" \
  --auto-create my_db.my_tbl \
  ./data.csv

These options are specific to bulk import from CSV/TSV files and they can be used to tailor the behavior of the parser to non standard CSV/TSV file formats:

CSV/TSV specific options:
--column-header                  first line includes column names
--delimiter CHAR                 delimiter CHAR; default="," at csv, "\t" at tsv
--newline TYPE                   newline [CRLF, LF, CR];  default=CRLF
--quote CHAR                     quote [DOUBLE, SINGLE, NONE];
                                 if csv format, default=DOUBLE. if tsv format, default=NONE

For further details, check the following pages:


Last modified: Feb 24 2017 09:27:52 UTC

If this article is incorrect or outdated, or omits critical information, please let us know. For all other issues, please see our support channels.