Visit our new documentation site! This documentation page is no longer updated.

Bulk Export

This article explains Treasure Data’s bulk-export feature, which lets you dump data into your Amazon S3 bucket.

At Treasure Data, we believe that your data belongs to you, even after importing it to our platform. We believe that vendor-lockin MUST be stopped.

Untitled-3
(We're limiting export capability to only us-east region S3 bucket.) If you would like to unlock this limitation, please contact to our support.

Table of Contents

Prerequisites

  • Basic knowledge of Treasure Data, including the Treasure Data Toolbelt.
  • Amazon AWS account and Amazon S3 bucket.

Table Dump

The td table:export command will dump all the data uploaded to TD into your Amazon S3 bucket. Please specify the database and table from which to dump your data.

$ td table:export database_name table_name \
   --s3-bucket <S3_BUCKET_NAME> \
   --prefix <S3_FILE_PREFIX> \
   --aws-key-id <AWS_KEY> \
   --aws-secret-key <AWS_SECRET_KEY> \
   --file-format jsonl.gz
Untitled-3
We highly recommend to use jsonl.gz or tsv.gz format, because we have specific performance optimizations. Other formats are way slower.

The dump is performed via MapReduce jobs. Where the location of the bucket is expressed as an S3 path with the AWS public and private access keys embedded in it.

usage:
  $ td table:export <db> <table>

example:
  $ td table:export example_db table1 --s3-bucket mybucket -k KEY_ID -s SECRET_KEY

description:
  Dump logs in a table to the specified storage

options:
  -w, --wait                       wait until the job is completed
  -f, --from TIME                  export data which is newer than or same with the TIME (unixtime e.g. 1446617523)
  -t, --to TIME                    export data which is older than the TIME (unixtime e.g. 1480383205)
  -b, --s3-bucket NAME             name of the destination S3 bucket (required)
  -p, --prefix PATH                path prefix of the file on S3
  -k, --aws-key-id KEY_ID          AWS access key id to export data (required)
  -s, --aws-secret-key SECRET_KEY  AWS secret access key to export data (required)
  -F, --file-format FILE_FORMAT    file format for exported data.
                                   Available formats are tsv.gz (tab-separated values per line) and jsonl.gz (JSON record per line).
                                   The json.gz and line-json.gz formats are default and still available but only for backward compatibility purpose;
                                     use is discouraged because they have far lower performance.
  -O, --pool-name NAME             specify resource pool by name
  -e, --encryption ENCRYPT_METHOD  export with server side encryption with the ENCRYPT_METHOD
  -a ASSUME_ROLE_ARN,              export with assume role with ASSUME_ROLE_ARN as role arn
      --assume-role

Support Server-side Encryption

Server-side encryption is about protecting data at rest. Our Bulk Export supports some of Server-side encryption.

Now, td table:export command with --encryption ENCRYPT_METHOD is able to dump all the data uploaded to TD into your encrypted storage. This option is available in td command since version 0.14.0.

The following command is a example for x-amz-server-side-encryption: AES256 on S3:

  $ td table:export example_db table1 -F jsonl.gz --s3-bucket mybucket -k KEY_ID -s SECRET_KEY --encryption s3

Best Practices for Achieving Time Partitioning in S3 of Data Exported using Bulk Export

A common question we’ve received is “Can Treasure Data make sure the data exported through this process is partitioned according to hourly bucket, similar as the partitioning strategy that which is maintained within the core Treasure Data system?”

Unfortunately, as noted above, the Bulk Export command no longer supports partitioning of exported data. This is to optimize the speed of export, which the majority of users found too slow to meet their requirements.

If you do require partitioning, we recommend using this command to export 1 hour segments at a time – automating the process with a script. While we know this isn’t the most convenient approach, please use this approach to achieve any time-based partitioning in your bulk exported data. We will continue to consider improvements in this process for the future.

References


Last modified: Jun 28 2017 03:41:10 UTC

If this article is incorrect or outdated, or omits critical information, let us know. For all other issues, access our support channels.