Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Basic knowledge of Arm Treasure Data.

  • Basic knowledge of Presto query engine. Learn how to write Presto Queries.

Tuning Tips

Only Specifying the Columns you Need

Actual data in Treasure Data is stored as a columnar storage format, which is optimized for the query using only specific columns. Restricting accessed columns can improve your query performance significantly. Specify only needed columns instead of using a wildcard (*).

Code Block
[GOOD]: SELECT time,user,host FROM tbl
[BAD]:  SELECT * FROM tbl

Leveraging Time-Based Partitioning

All imported data is automatically partitioned into hourly buckets, based on the ‘time’ field within each data record.

By specifying the time range in your query, you can avoid reading unnecessary data and can thus speed up your query significantly.

Specify Time as an Integer

...

Code Block
[GOOD]: SELECT field1, field2, field3 FROM tbl WHERE time > 1349393020
[GOOD]: SELECT field1, field2, field3 FROM tbl WHERE time > 1349393020 + 3600
[GOOD]: SELECT field1, field2, field3 FROM tbl WHERE time > 1349393020 - 3600
[BAD]:  SELECT field1, field2, field3 FROM tbl WHERE time > 13493930200 / 10
[BAD]:  SELECT field1, field2, field3 FROM tbl WHERE time > 1349393020.00
[BAD]:  SELECT field1, field2, field3 FROM tbl WHERE time BETWEEN 1349392000 AND 1349394000

Use TD_TIME_RANGE

You can use TD_TIME_RANGE to partition data.

Code Block
[GOOD]: SELECT ... WHERE TD_TIME_RANGE(time, '2013-01-01 PDT')
[GOOD]: SELECT ... WHERE TD_TIME_RANGE(time, '2013-01-01','PDT', NULL)
[GOOD]: SELECT ... WHERE TD_TIME_RANGE(time, '2013-01-01',
                                       TD_TIME_ADD('2013-01-01', '1day', 'PDT'))

However, if you use division in TD_TIME_RANGE, the time partition optimization doesn’t work. For instance, the following conditions disable optimization.

...

Bad SQL

...

Code Block
SELECT ... WHERE TD_TIME_RANGE(time, TD_SCHEDULED_TIME() / 86400 * 86400))

...

Code Block
SELECT ... WHERE TD_TIME_RANGE(time, 1356998401 / 86400 * 86400))

 

The TD_INTERVAL user defined function is also available for partitioning. TD_INTERVAL provides an intuitive way to specify the time range. For example, to select the last 7 days:

Code Block
SELECT ... WHERE TD_INTERVAL(time, '-7d')

Considering the Cardinality within the GROUP BY Clause

It's possible to improve the performance of the GROUP BY function by carefully ordering a list of fields within the GROUP BY in an order of high cardinality.

...

SQL Example

...

 

...

good

...

SELECT GROUP BY uid, gender

...

bad

...

SELECT GROUP BY gender, uid

Or, use numbers instead of strings for the GROUP BY column, because numbers require less memory and are faster to compare than strings.

 

Use LIMIT with ORDER BY

ORDER BY requires that all rows be sent to a single worker which then sorts them. ORDER BY might often require a lot of memory on a Presto worker. When you want to look up the top or bottom N records, use LIMIT which can reduce the sort cost and memory pressure.

...

SQL Example

...

 

...

good

...

SELECT * FROM tbl ORDER BY time LIMIT 100

...

bad

...

SELECT * FROM tbl ORDER BY time

 

Using Approximate Aggregate Functions

Presto has a couple of approximate aggregation functions, that give you significant performance improvements, but with some errors. For example, by using approx_distinct() function, you can get an approximation of COUNT(DISTINCT x)with a standard error of 2.3%. The following example gives an approximate count of the previous day’s unique users.

Code Block
SELECT
  approx_distinct(user_id)
FROM
  access
WHERE
  TD_TIME_RANGE(time,
    TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d', 'PDT'),
    TD_SCHEDULED_TIME())

 

Aggregating a Series of LIKE Clauses in One Single regexp_like Clause

Presto’s query optimizer is unable to improve queries that contain many LIKE clauses. As a consequence, the query execution might be slower than expected.

To improve the performance, you can substitute a series of LIKE clauses that are chained with the OR condition with a single regexp_like clause, which is Presto native.

For example:

Code Block
SELECT
  ...
FROM
  access
WHERE
  method LIKE '%GET%' OR
  method LIKE '%POST%' OR
  method LIKE '%PUT%' OR
  method LIKE '%DELETE%'

can be optimized by replacing the four LIKE clauses with a single regexp_like clause:

Code Block
SELECT
  ...
FROM
  access
WHERE
  regexp_like(method, 'GET|POST|PUT|DELETE')

 

Specifying Large Tables First in Join Clauses

The default join algorithm of Presto is broadcast join, which partitions the left-hand side table of a join and sends (broadcasts) a copy of the entire right-hand side table to all of the worker nodes that have the partitions. This type of join works when your right-hand side table is small enough to fit within one node (usually less than 2GB). If you receive an ‘Exceeded max memory xxGB’ error, then the right-hand side table is too large. Presto does not perform automatic join-reordering, so make sure your large table precedes small tables in any join clause.

Turning on the Distributed Hash Join

If you still see the memory issue, try a distributed hash join. This algorithm partitions both the left and right tables using the hash values of the join keys. So the distributed join works even if the right-hand side table is large, but the performance might be slower because the join increases the number of network data transfers. To turn on the distributed join, embed the following session property as an SQL comment:

Code Block
DROP TABLE IF EXISTS my_result;
-- set session join_distribution_type = 'PARTITIONED'
CREATE TABLE my_result AS 
SELECT ... FROM large_table l, small_table s WHERE l.id = s.id

The query processing may be slow if there is a skew of the key values in your data.

Improving the Performance of Very Large Queries

You can parallelize the query result output process by using the CREATE TABLE AS SELECT statement. If you DROP the table before running the query, your performance is significantly better. The result output performance will be 5x faster than running SELECT *. Treasure Data Presto skips the JSON output process and directly produces a 1-hour partitioned table.

Without using DROP TABLE, Presto uses JSON text to materialize query results. And if the result table contains 100GB of data, the coordinator transfers more than 100GB of JSON text to save the query result. So, even if the query computation is almost finished, the output of the JSON results takes a long time.

To clean up the result table beforehand:

  1. Add a DROP TABLE statement at the top of your query.

  2. Use CREATE TABLE (table) AS SELECT … 

For example, your query might look like this:

Code Block
DROP TABLE IF EXISTS my_result;
CREATE TABLE my_result AS 
    SELECT * FROM my_table;

Presto Query FAQs

...

Children Display