This article describes some tips and tricks for bulk import.



Solving Error: There Was a Problem Accessing the Remote XML Resource

This error may have occurred before v0.16.7 versions.

When you encounter the following error in td import:jar_update, use one of the following solutions to resolve the error.

Error message:

Error: There was a problem accessing the remote XML resource 
'http://central.maven.org/maven2/com/treasuredata/td-import/maven-metadata.xml' 
(TreasureData::Command::UpdateError: An error occurred when fetching 
from 'http://central.maven.org/maven2/com/treasuredata/td-import/maven-metadata.xml'.)

Solution 1: Update TD Toolbelt Version

The error is solved in v0.16.8 or later versions

Solution 2: Set Variable

Setting an environment variable can avoid the error. 

$ export TD_TOOLBELT_JARUPDATE_ROOT=https://repo1.maven.org


Using a Proxy Server

If you cannot upload your data, verify that your network is using a proxy. You can set the proxy by setting the environment variables:

Operating System

Option 1

Option 2

Windows

$ set HTTP_PROXY=http://proxy_host:8080

$ set HTTP_PROXY=http://user:password@proxy_host:8080

Other

$ export HTTP_PROXY="proxy_host:8080"

$ export HTTP_PROXY="user:password@proxy_host:8080"


Increasing Performance through Parallelism

td import:auto supports two options to tune parallelism: --parallel and --prepare-parallel.

See the TD Toolbelt Command Reference for the full syntax reference.

$ td import:auto <session name> <files...>
--parallel NUM                    
--prepare-parallel NUM            
  • Parallel specifies how many threads are to be used for uploading the data. If you observe that the bulk import tool is not saturating your network, you can increase the value of the --parallel option. Default is 2, maximum is 8.

  • Prepare parallel specifies the number of threads to be used to compress the data locally. Normally, this number should match the number of CPU cores on your machine. Default is 2, maximum is 96.

Specifying a Time Column for Maximum Query Performance

Don’t specify ‘0’ if you don’t have a time column. Treasure Data partitions the data by time by default. See Data Partitioning. It is recommended to always specify the time column, or specify the current time.

Selecting Enable or Disable Auto Jar_Update

The option to select Enable/Disable auto jar_update can be included in td v0.11.2 and later versions

An environment variable hook: TD_TOOLBELT_JAR_UPDATE.

JAR auto-update is enabled by default or enabled if the variable is 1:

$ td import:prepare
$ TD_TOOLBELT_JAR_UPDATE=1 td import:prepare

JAR auto-update is disabled then variable is set to 0:

$ TD_TOOLBELT_JAR_UPDATE=0 td import:prepare

but this setting does not affect td import:jar_update, which always updates the JAR file.

Confirming Time Zone

The bulk import tool uses a TZ environment variable. If you think your bulk import time zone is wrong, check your TZ environment variable.

Encoding Shift_JIS

When you encode shift_jis you should set the encoding option to ‘-e Windows-31J’.

Using Time-Format

If you want to assign the data source’s time format to bulk import, you can use --time-format in accordance with the following correspondence table.

Letter

Date or Time Component

Presentation

Examples

Y,G

Year with Century

Year

1996; 2006

y,g

The last 2 digits of Year

Year

96; 06

m

Month in year

Month

01..12

B,b

The full/abbreviated month name

Month

January; Jan

d,e

Day in a month, zero/blank padded

Number

01..31; 1…31

V

Week number of the week-based Year

Number

01..53

j

Day in year

Number

0-365

A,a

The full/abbreviated day name in the week

Text

Tuesday; Tue

H,k

Hour in day

Number

00-23; 0-23

I,l

Hour in day

Number

00-11; 0-11

M

Minute in hour

Number

00-59

S

Second in minute

Number

00-59

L

Millisecond

Number

000-999

P,p

AM/PM; am/pm marker

Text

AM; PM; am; pm

Z,z

Time zone

General time zone

GMT-08:00; -0800

c

Year to second

Text

Tue Jan 1 14:00:00 2016

D,x

Year to date

Text

01/01/16

F

Year to date

Text

2016-01-01

T,X

Hour to second

Text

14:00:00

r

Hour to second am/pm

Text

02:00:00 pm

R

Hour to minute

Text

14:00

n

Newline character

LF

\n

t

Tab character

Tab

\t

%

Literal % character

%

%

  • No labels