Treasure Data uses two query engines that use SQL commands as well as a pre-defined query interface that allows the user to filter data using company-defined attribute and behavior filters. All of this can be found in the TD Console. The two query engines are Presto and Hive and the query editor for those are found in the Data Workbench under Queries. The pre-defined query interface is found in the Audience Studio under Master Segments.
Queries are a request for information from a database. In Treasure Data, queries are used to give you regular insights in to what your customer behavior is from the variety of first, second and third party data sources that your organization has access to. You can create regularly scheduled queries that will update specific information at regular intervals or you can create queries that provide moment-in-time snapshots of your customers.
Using Presto or Hive instead of the Query UI in Audience Studio, allows for more customized querying.
Presto is designed for short interactive queries useful for data exploration. It uses the industry-standard ANSI SQL dialect.
It offers better real time results than Hive, but has high latency in processing large amounts of data. Additionally, it has some limits in memory capacity, so if the data you are querying exceeds that memory capacity, the query will fail. At the same time, it processes queries 10-30X faster than Hive.
So when would one use Presto?
You have a small to medium size query and you want immediate results for a specific ad hoc or exploratory query.
For information on how to write queries in Presto refer to this About Presto Query Engine.
Hive is good for batch processing and regularly scheduled queries. It has a low query latency when it comes to larger amounts of data, but has a slower processing time so it’s not as good for smaller ad hoc and exploratory queries. Hive is good for queries where it’s important that it has a 99.9% chance of running all the way to completion because it has a higher fault tolerance and will not need the entire query to be rerun if an error results. Hive uses a dialect of SQL language called HiveQL. HiveQL has a number of good features especially for the use cases of querying a large number of tables, but doesn’t have as complete and familiar a feature set as Presto.
So when would one use Hive?
You have a large size query for which you don’t need immediate results, as well as you have regularly scheduled queries.
For information on how to write queries in Hive refer to this About Hive Query Engine.