Data Tanks Overview
Treasure Data can be considered an “event data lake” where disparate event data sources (and a few slow moving dimensions) are aggregated and processed to create more compact and “cleaner” data packages for further processing, analysis or visualization.
Given the size and scope of an event data lake, providing highly concurrent interactive access over trillions of data points while retaining schema flexibility is technologically impossible (at least affordably). As a way to work around this limitation, a design pattern called “lakeshore data marts”.
image credit: Martin Fowler, DataLake
We named our new product “Data Tanks” to stay within the data as water metaphor as we believe provides a great mental model for how data pipelines for analytics works.
Data Tanks provide a convenient and accessible metric store to drive business intelligence and data visualization tools from Treasure Data without the burden of managing one or more separate data marts.
|This is an add-on feature. Please contact us if you're interested.|
Table of Contents
Create a new Data Tank
For now, creating a new Data Tank is a completely manual process performed by our operations team. We are building the necessary API and UI support and we will release a testing version as soon as that is ready. In the meantime, if you need us to create an instance, let us know the following:
- IP range from which you will be connecting to the Data Tank
- Maximum storage size needed
- Approximate number of concurrent connections you foresee making to the Data Tank
Once the instance is available, you will see TWO new result exports available in your “Favorites” list.
What is cstore?
Data Tank is available in two flavors: Row-oriented (vanilla) and Columnar (cstore). If you are not planning on using the following features of PostgreSQL within your Data Tank:
- column deletion
- column update
and you’re only planning to run analytical workloads, then you should use the cstore option which will have the following benefits:
- shrink your data footprint by 3 to 4 times on average (up to 12 times for JSON data)
- speed up your queries by an average of 30%, and another 30% more by using indexing
Exporting data to a Data Tank
Exporting data to a Data Tank is just like using a normal TD result export.
The UI of the result export in the Console shows the following fields:
- Export to: reflects the type of the Data Tank instance (currently PostgreSQL is the only type available).
- Host: preset to the IP address of your instance.
- Username: created and preset by Treasure Data ops team
- Password: created and preset by Treasure Data ops team
- Use SSL?: determines whether SSL is used to connect to the instance. The default is checked.
- Database: the name of the destination database.
- Table: the name of the table within the database selected above.
- Mode: default is “append”. Replace, Truncate and Update are other options. See our existing documentation for more information. In the case of Update, the Unique key must be set explicitly.
- Use compression?: this is leftover from a previous version of our API and it will be removed soon. The default setting is unchecked.
- Method: this option controls the way data is written to the instance. Copy should be used for better performance, while Insert might be necessary for historical reasons.
- Schema: selects a target schema for the table. If not specified, then the default schema is used.
Using Luigi-TD with a Data Tank
It’s possible to use Luigi-TD or CLI to write data to a Data Tank as it’s treated no differently than any other result output. Data Tanks are currently not included in our API so Luigi-TD or the CLI can’t control them directly. This is going to change in the future. Please visit Luigi-TD for more information.
Using Foreign Data Wrapper for Treasure Data
It’s possible to issue query to TreasureData from Data Tank directly by using Hive and Presto. It makes your Treasure Data dataset appear as foreign tables in your data tank’s database.
Specify your API key, database, query engine type (
CREATE FOREIGN TABLE statement. You can specify either your table name or query for Treasure Data directly.
CREATE FOREIGN TABLE sample_datasets ( time integer, "user" varchar, host varchar, path varchar, referer varchar, code integer, agent varchar, size integer, method varchar ) SERVER td_server OPTIONS ( apikey 'your_api_key', database 'sample_datasets', query_engine 'presto', table 'www_access' ); SELECT code, count(1) FROM sample_datasets WHERE time BETWEEN 1412121600 AND 1414800000 -- For time index pushdown in Treasure Data GROUP BY code; code | count ------+------- 404 | 17 200 | 4981 500 | 2 (3 rows) CREATE FOREIGN TABLE nginx_status_summary ( text varchar, cnt integer ) SERVER td_server OPTIONS ( apikey 'your_api_key', database 'api_staging', query_engine 'presto', query 'SELECT c.text, COUNT(1) AS cnt FROM nginx_access n JOIN mitsudb.codes c ON CAST(n.status AS bigint) = c.code WHERE TD_TIME_RANGE(n.time, ''2015-07-05'') GROUP BY c.text' ); SELECT * FROM nginx_status_summary; text | cnt ---------------+---------- OK | 10123456 Forbidden | 12 Unauthorized | 3211 : CREATE TABLE imported_summary AS SELECT * FROM nginx_status_summary; SELECT * FROM imported_summary; text | cnt ---------------+---------- OK | 10123456 Forbidden | 12 Unauthorized | 3211
Also, you can specify other API endpoint.
SERVER td_server OPTIONS ( endpoint 'https://api.ybi.idcfcloud.net' apikey 'your_api_key', ...
Last modified: Jun 29 2016 08:00:35 UTC