This article explains how to import data directly from Amazon S3 to Treasure Data.
First, install the Toolbelt, which includes bulk loader program, on your computer.
After the installation, the td command will be installed on your computer. Open up the terminal and type td to execute the command. Also, make sure you have java as well. Run td import:jar_update to download the up-to-date version of our bulk loader:
$ td
usage: td [options] COMMAND [args]
$ java
Usage: java [-options] class [args...]
$ td import:jar_update
Installed td-import.jar 0.x.xx into /path/to/.td/javaLog in to your Treasure Data account.
$ td account -f
Enter your Treasure Data credentials.
Email: xxxxx
Password (typing will be hidden):
Authenticated successfully.
Use 'td db:create db_name' to create a database.The bulk loader can read data from files stored in Amazon S3 in all three supported file formats:
- CSV (default)
- JSON
- TSV
Suppose you have a file called data.csv on Amazon S3 with these contents:
"host","log_name","date_time","method","url","res_code","bytes","referer","user_agent"
"64.242.88.10","-","2004-03-07 16:05:49","GET","/twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables",401,12846,"",""
"64.242.88.10","-","2004-03-07 16:06:51","GET","/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2",200,4523,"",""
"64.242.88.10","-","2004-03-07 16:10:02","GET","/mailman/listinfo/hsdivision",200,6291,"",""
"64.242.88.10","-","2004-03-07 16:11:58","GET","/twiki/bin/view/TWiki/WikiSyntax",200,7352,"",""Execute the following commands to upload the CSV file:
$ td db:create my_db
$ td table:create my_db my_tbl
$ td import:auto \
--format csv --column-header \
--time-column date_time \
--time-format "%Y-%m-%d %H:%M:%S" \
--auto-create my_db.my_tbl \
"s3://s3_access_key:s3_secret_key@/my_bucket/path/to/data.csv"where the location of the file is expressed as an S3 path with the AWS public and private access keys embedded in it.
Because td import:auto executes MapReduce jobs to check the invalid rows, it'll take at least 1-2 minutes. If the column chosen for --time-column is in epoch timestamp (or unix time), you don't need the --time-format flag.
In the above command, we assumed that:
- The CSV files are located on Amazon S3, within a bucket called
my_bucketunder this path/key/path/to/. - The first line in the file indicates the column names, hence we specify the
--column-headeroption. If the file does not have the column names in the first row, you will have to specify the column names with the--columnsoption (and optionally the column types with--column-typesoption), or use the--column-typesfor each column in the file. - The time field is called “date_time” and it’s specified with the
--time-columnoption - The time format is
%Y-%m-%d %H:%M:%Sand it’s specified with the--time-formatoption
The source files to be imported by the bulk loader can be specified as full Amazon S3 paths or using wildcards. Here are some examples:
s3://my_bucket/path/to/data*All files undermy_bucket/path/to/with prefixdata;s3://my_bucket/path/to/data*.csvAll files undermy_bucket/path/to/with prefixdataand extension.csv;s3://my_bucket/path/to/*.csvAll files undermy_bucket/path/to/with extension.csv;s3://my_bucket/path/to/*All files undermy_bucket/path/to/;s3://my_bucket/path/to/*/*.csvAll files in the direct subfolders ofmy_bucket/path/with extension.csv;s3://my_bucket/**/*.csvAll files in all subfolders ofmy_bucket/path/with extension.csv;
For further details, check the following pages: