Treasure Data uses two query engines that use SQL commands as well as a pre-defined query interface that allows the user to filter data using company-defined attribute and behavior filters. All of this can be found in the TD Console. The two query engines are Presto and Hive and the query editor for those are found in the Data Workbench under Queries. The pre-defined query interface is found in the Audience Studio under Master Segments.
In Treasure Data, queries are used to give you regular insights in to what your customer behavior is from the variety of first, second and third party data sources that your organization has access to. You can create regularly scheduled queries that will update specific information at regular intervals or you can create queries that provide moment-in-time snapshots of your customers.
You can execute queries from the following interfaces:
- TD Console
- API
- JDBC/ODBC
- Scheduled queries
- Hosted workflow execution framework
Every query is managed as a job, and for each job you can specify a query engine.
Prior to its rebranding in 2021, the Trino query engine was known as PrestoSQL or more simply Presto. Treasure Data supported several earlier versions of Presto. Newer versions of Presto are now supported under the Trino branding, and support for Trino 423 has now been introduced. For the immediate future, you will see both Presto and Trino used in Treasure Data documentation. New functionality or new features will be identified as being Trino; however, documentation for existing functions and features that are common to both Presto and Trino will continue to be identified as Presto.
Presto is designed for short interactive queries useful for data exploration. It uses the industry-standard ANSI SQL dialect.
It offers better real time results than Hive, but has high latency in processing large amounts of data. Additionally, it has some limits in memory capacity, so if the data you are querying exceeds that memory capacity, the query will fail. At the same time, it processes queries 10-30X faster than Hive.
So when would one use Presto?
You have a small to medium size query and you want immediate results for a specific ad hoc or exploratory query.
You can find more details about Presto in About Presto Query Engine.
Hive is good for batch processing and regularly scheduled queries. It has a low query latency when it comes to larger amounts of data, but has a slower processing time so it’s not as good for smaller ad hoc and exploratory queries. Hive is good for queries where it’s important that it has a 99.9% chance of running all the way to completion because it has a higher fault tolerance and will not need the entire query to be rerun if an error results. Hive uses a dialect of SQL language called HiveQL. HiveQL has a number of good features especially for the use cases of querying a large number of tables, but doesn’t have as complete and familiar a feature set as Presto.
So when would one use Hive?
You have a large size query for which you don’t need immediate results, as well as you have regularly scheduled queries.
You can find more details about Hive in About Hive Query Engine.
Trino (Presto) — Trino is designed for fast, interactive queries. Trino (Presto) is optimized for latency. Its execution model is refered to as push-model.
Hive — Designed for querying of large data systems in the open-source Hadoop platform. Hive 2023.1 converts SQL-like queries into Tez jobs for easy execution and processing of extremely large volumes of data. Hive is optimized for query throughput. Its execution model is refered to as pull-model.
| Trino (Presto) | Hive | |
|---|---|---|
| Optimized for | Interactivity | Throughput |
| SQL Standardized fidelity | Designed to comply with ANSI SQL | HiveQL (a subset of common data warehousing SQL) |
| Large JOINs | Optimized for star scheme joins (1 large fact table and many smaller dimension tables) | Very good for large fact-to-fact joins |
| Maximum Query Length | 1MB | No Hard Limit |