Bulk Import from TSV file
This article explains how to import data from TSV files to Treasure Data.
Table of Contents
Install Bulk Loader
First, please install the toolbelt, which includes bulk loader program, on your computer.
After the installation, the
td command will be installed on your computer. Open up the terminal and type
td to execute the command. Also, please ensure you have
java as well. Execute the command
td import:jar_update to download the up-to-date version of our bulk loader:
$ td usage: td [options] COMMAND [args] $ java Usage: java [-options] class [args...] $ td import:jar_update Installed td-import.jar 0.x.xx into /path/to/.td/java
Importing data from a TSV
Suppose you have a file called data.tsv and its content looks like this:
$ head -n 2 data.tsv host log_name date_time method url res_code bytes referer user_agent 18.104.22.168 - 2004-03-07 16:05:49 GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables" 401 12846
Execute the following commands to upload the TSV file.
$ td db:create my_db $ td table:create my_db my_tbl $ td import:auto \ --auto-create my_db.my_tbl \ --format tsv --column-header \ --time-column date_time \ --time-format "%Y-%m-%d %H:%M:%S" \ ./data.tsv
|Because `td import:auto` runs MapReduce jobs to check the invalid rows, it'll take at least 1-2 minutes.|
In the above command, we assumed that:
- The data file is called data.tsv and is located in the current directory (hence ./data.tsv)
- The first line in the file indicates the column names, hence we specify the --column-header option. If the file does not have the column names in the first row, you will have to specify the column names with the --columns option (and optionally the column types with --column-types option), or use the --column-types for each column in the file.
- The time field is called “date_time” and it’s specified with the --time-column option
- The time format is %Y-%m-%d %H:%M:%S and it’s specified with the --time-format option
These options are specific to bulk import from CSV/TSV files and they can be used to tailor the behavior of the parser to non standard CSV/TSV file formats:
CSV/TSV specific options: --column-header first line includes column names --delimiter CHAR delimiter CHAR; default="," at csv, "\t" at tsv --newline TYPE newline [CRLF, LF, CR]; default=CRLF --quote CHAR quote [DOUBLE, SINGLE, NONE]; if csv format, default=DOUBLE. if tsv format, default=NONE
For further details, check the following pages:
Last modified: Aug 03 2015 00:01:48 UTC