Visit our new documentation site! This documentation page is no longer updated.

Data Tanks Overview

Treasure Data can be considered an “event data lake” where disparate event data sources (and a few slow moving dimensions) are aggregated and processed to create more compact and “cleaner” data packages for further processing, analysis or visualization.

Given the size and scope of an event data lake, providing highly concurrent interactive access over trillions of data points while retaining schema flexibility is technologically impossible (at least affordably). As a way to work around this limitation, a design pattern called “lakeshore data marts”.

alt text image credit: Martin Fowler, DataLake

We named our new product “Data Tanks” to stay within the data as water metaphor as we believe provides a great mental model for how data pipelines for analytics works.

Data Tanks provide a convenient and accessible metric store to drive business intelligence and data visualization tools from Treasure Data without the burden of managing one or more separate data marts.

Table of Contents

Create a new Data Tank

For now, creating a new Data Tank is a completely manual process performed by our operations team. We are building the necessary API and UI support and we will release a testing version as soon as that is ready. In the meantime, if you need us to create an instance, let us know the following:

  1. IP range from which you will be connecting to the Data Tank
  2. Maximum storage size needed
  3. Approximate number of concurrent connections you foresee making to the Data Tank

After agreeing on the instance size, the operations team provisions a new Data Tank.

To confirm Data Tanks are successfully provisioned, open a new query editor screen, check the “Output Results” box and search for “datatank” in saved connections. “datatank” and “datatank_cstore” should show up.

alt text

What is cstore?

Data Tank is available in two flavors: Row-oriented (vanilla) and Columnar (cstore). If you are not planning on using the following features of PostgreSQL within your Data Tank:

  • column deletion
  • column update

and you’re only planning to run analytical workloads, then you should use the cstore option which will have the following benefits:

  • shrink your data footprint by 3 to 4 times on average (up to 12 times for JSON data)
  • speed up your queries by an average of 30%, and another 30% more by using indexing

Exporting data to a Data Tank

Exporting data to a Data Tank is just like using a normal TD result export.

The UI of the result export in the Console shows the following fields: Result Export UI

  1. Database: the name of the destination database. Do not change this.
  2. Table: the name of the table within the database selected above. If the destination table does not exist, it is created on the fly.
  3. Mode: default is “append”. Replace, Truncate and Update are other options. See our existing documentation for more information. In the case of Update, the Unique key must be set explicitly.
  4. Method: this option controls the way data is written to the instance. Copy should be used for better performance, while Insert might be necessary for historical reasons.
  5. Schema: selects a target schema for the table. If not specified, then the default schema is used.
  6. Foreign Data Wrapper: use the cstore extension or not.
Untitled-3
We provide two type of users.
tank_integration_user: This user is for accessing to DataTanks from Treasure Data components, for example: Output result to DataTanks, pg> operator from TD Workflow. Please contact support if tank_integration_user password is required.
tank_user: This user is for accessing user side client. You can access to DataTanks like a PostgreSQL using this user.

Querying Data Tanks from Treasure Data’s console.

You can query Data Tanks using Presto, including the ability to join data across PlazmaDB (primary storage) and Data Tanks.

Some use cases include:

  1. Store user master data (which requires row-by-row lookups) in a Data Tank and join it against raw event data in PlazmaDB
  2. Cache the result of an expensive analytic query in a Data Tank table and re-use them in Presto joins

Also, this means that Data Tanks can be queries as a task within Treasure Workflow.

Querying Treasure Data’s PlazmaDB (primary storage) storage from Data Tanks

It’s possible to issue query to Treasure Data’s PlazmaDB (primary storage) from Data Tanks directly. It makes your Treasure Data dataset appear as foreign tables in your data tank’s database.

Specify your API key, database, query engine type (presto or hive) in CREATE FOREIGN TABLE statement. You can specify either your table name or query for Treasure Data directly.

CREATE FOREIGN TABLE sample_datasets (
    time integer,
    "user" varchar,
    host varchar,
    path varchar,
    referer varchar,
    code integer,
    agent varchar,
    size integer,
    method varchar
)
SERVER td_server OPTIONS (
    apikey 'your_api_key',
    database 'sample_datasets',
    query_engine 'presto',
    table 'www_access'
);

SELECT code, count(1)
FROM sample_datasets
WHERE time BETWEEN 1412121600 AND 1414800000    -- For time index pushdown in Treasure Data
GROUP BY code;

 code | count
------+-------
  404 |    17
  200 |  4981
  500 |     2
(3 rows)

CREATE FOREIGN TABLE nginx_status_summary (
    text varchar,
    cnt integer
)
SERVER td_server OPTIONS (
    apikey 'your_api_key',
    database 'api_staging',
    query_engine 'presto',
    query 'SELECT c.text, COUNT(1) AS cnt FROM nginx_access n
          JOIN mitsudb.codes c ON CAST(n.status AS bigint) = c.code
          WHERE TD_TIME_RANGE(n.time, ''2015-07-05'')
          GROUP BY c.text'
);

SELECT * FROM nginx_status_summary;
     text      |   cnt
---------------+----------
 OK            | 10123456
 Forbidden     |       12
 Unauthorized  |     3211
    :

CREATE TABLE imported_summary AS SELECT * FROM nginx_status_summary;
SELECT * FROM imported_summary;
     text      |   cnt
---------------+----------
 OK            | 10123456
 Forbidden     |       12
 Unauthorized  |     3211

Also, you can specify other API endpoint.

SERVER td_server OPTIONS (
endpoint 'https://api.ybi.idcfcloud.net'
apikey 'your_api_key',
...

Last modified: Apr 11 2018 17:20:43 UTC

If this article is incorrect or outdated, or omits critical information, let us know. For all other issues, access our support channels.