Schema Management

Treasure Data uses the same convention as Relational Database Management Systems (RDBMSs) for managing data sets: Databases and Tables. In addition, Schema can be associated with tables.

Unlike traditional warehousing platforms, Treasure Data allows users to store-first, then schema-later. Schema can be changed at any time, with no cost.

Table of Contents

Prerequisites

  • Basic knowledge of Treasure Data, including the toolbelt.
  • A table with some data. You can find an example in the Getting Started guide.

Schema Changes Are Painful

Conventional warehousing platforms are schema dependent, supporting an assumptive analytics model. In this model, data elements forecasted to yield insights are defined in advance, as well as structure of the data store schema.

Performance considerations are also important in initial design and the analyst must have knowledge of the underlying structure to insure query performance. When new columns are added to the table, the schema must need to be changed.

Big Data analysis however, is largely non-assumptive. The analyst seeks hidden patterns, relationships or events in the data that were not intuitively obvious from the outset. The user must be able to query where the data takes him without the burden of performance considerations—and exploration can create requirements for new records to support the analysis trail.

In this model schema dependence adds a significant tax that can become prohibitive.

The TD Approach: Store-First, Schema-Later



Unlike traditional warehousing platforms, TD users can assign schema even after importing data to a table. This means that you can add or remove fields at any time.

This system is much more flexible, and schema changes no longer take days of work.

Default Schema (time and v)

When a table is created, it has two fields:

  • time: The time that each entry was generated, in int64 UNIX time
  • v: Each log entry, stored as map<string, string>

When we look up the value of a database entry, we must address the information using the format: v[‘field1’].

Furthermore, since the type of v is fixed as map<string, string>, all values are treated as a String. If we need to treat values as other data types, we must type cast them first. This is inefficient.

Setting Custom Schema

In general, the default schema is acceptable. But a custom schema makes queries shorter, and improves performance up to 30%. To set a schema to a table, we can use the ‘td schema:set’ command, as shown below:

$ td schema:set <database> <table> <column_name>:<type>...
Untitled-3
Schemas currently support column names consisting of lowercase alphabets, numbers, and "_" only.

Treasure Data supports the following data types:

  • int
  • long
  • double
  • float
  • string
  • array<T>

In the “Getting Started” guide, we set up a table with ‘user’ (int type) and ‘action’ (string type) columns. For this example, the schema would be added as follows.

$ td schema:set testdb www_access action:string labeles:'array<string>' user:int

You can now query this table without using v.

$ td query -w -d testdb "SELECT user, COUNT(1) AS cnt
  FROM www_access
  WHERE action='login'
  GROUP BY user ORDER BY cnt DESC"

Using v still works properly.

$ td query -w -d testdb "SELECT v['user'], COUNT(1) AS cnt
 FROM www_access
 WHERE v['action']='login' 
 GROUP BY v['user'] ORDER BY cnt DESC"

Next Step

For queries, we leverage the Hive Query Language.

For queries, we leverage the Presto Query Language.

Schema relation

Treasure Data Presto Hive
int bigint smallint
int bigint int
long bigint bigint
double double decimal(Hive 0.11.0 and later)
float double float
double double double
Convert to string or int boolean boolean
string varchar string or varchar (Hive 0.11.0 and later)
string or Convert to long date string
string or Convert to long timestamp timestamp

Referrers


Last modified: Feb 24 2017 09:27:52 UTC

If this article is incorrect or outdated, or omits critical information, please let us know. For all other issues, please see our support channels.