Bulk Import Tips and Tricks

This article describes some tips and tricks for bulk import.

Table of Contents

How to use a proxy server

If you cannot upload, please first check if your network is using a proxy. You can set the proxy by setting the environment variables:

# Windows:
$ set HTTP_PROXY=http://proxy_host:8080
# Other:
$ export HTTP_PROXY="proxy_host:8080"

How to increase performance through parallelism

td import:auto supports two options to tune parallelism: --parallel and --prepare-parallel.

--parallel NUM                   upload in parallel (default: 2; max 8)
--prepare-parallel NUM           prepare in parallel (default: 2; max 96)

--parallel specifies how many threads are to be used for uploading the data. If you observe that the bulk import tool is not saturating your network, you can increase the value of the --parallel option.

--prepare-parallel specifies the number of threads are to be used to compress the data locally. Normally, this number should match the number of CPU cores on your machine.

How to specify time column for maximum query performance

Please don’t specify ‘0’ if you don’t have a time column. Treasure Data partitions the data by time by default (See Data Partitioning). It’s recommended to always specify the time column, or specify the current time.

How to select Enable/Disable auto jar_update

The option to select Enable/Disable auto jar_update can be included in td v0.11.2 and later versions

An environment variable hook: TD_TOOLBELT_JAR_UPDATE.

JAR auto-update is enabled by default or enabled if the variable is 1:

$ td import:prepare
$ TD_TOOLBELT_JAR_UPDATE=1 td import:prepare

JAR auto-update is disabled then variable is set to 0:

$ TD_TOOLBELT_JAR_UPDATE=0 td import:prepare

but this setting does not affect td import:jar_update, which always updates the JAR file.

How to confirm Time Zone

The bulk import tool use a TZ environment variable. If you think your bulk import time zone is wrong, please check your TZ environment variable.

How to encode Shift_JIS

When you encode shift_jis you should set encoding option to ‘-e Windows-31J’.

How to use —time-format

If you want to assign data source’s time format to bulk import, you can use —time-format in accordance with following correspondence table.

Letter Date or Time Component Presentation Examples
Y,G Year with Century Year 1996; 2006
y,g The last 2 digits of Year Year 96; 06
m Month in year Month 01..12
B,b The full/abbreviated month name Month January; Jan
d,e Day in month, zero/blank padded Number 01..31; 1…31
V Week number of the week-based Year Number 01..53
j Day in year Number 0-365
A,a The full/abbreviated day name in week Text Tuesday; Tue
H,k Hour in day Number 00-23; 0-23
I,l Hour in day Number 00-11; 0-11
M Minute in hour Number 00-59
S Second in minute Number 00-59
L Millisecond Number 000-999
P,p AM/PM;am/pm marker Text AM; PM; am; pm
Z,z Time zone General time zone GMT-08:00; -0800
c Year to second Text Tue Jan 1 14:00:00 2016
D,x Year to day Text 01/01/16
F Year to day Text 2016-01-01
T,X Hour to second Text 14:00:00
r Hour to second am/pm Text 02:00:00 pm
R Hour to minute Text 14:00
n Newline character LF \n
t Tab character Tab \t
% Literal % character % %

Last modified: Apr 25 2016 04:39:36 UTC

If this article is incorrect or outdated, or omits critical information, please let us know. For all other issues, please see our support channels.