Presto Performance Tuning

Table of Contents

Prerequisites

Only specifying the columns you need

Actual data in Treasure Data is stored as columnar storage format, which is optimized for the query using only specific columns. So restricting accessed columns can improve your query performance significantly. Please specify the needed columns instead of using wildcard (*).

[GOOD]: SELECT time,user,host FROM tbl
[BAD]:  SELECT * FROM tbl

Leveraging Time-based partitioning

All imported data is automatically partitioned into hourly buckets, based on the ‘time’ field within each data record. By specifying the time range in your query, you can avoid reading unnecessary data and can thus speed up your query significantly.

1.) WHERE time <=> Integer

When the ‘time’ field within the WHERE clause is specified, the query parser will automatically detect which partition(s) should be processed. Please note that this auto detection will not work if you specify the time with float instead of int.

[GOOD]: SELECT field1, field2, field3 FROM tbl WHERE time > 1349393020
[GOOD]: SELECT field1, field2, field3 FROM tbl WHERE time > 1349393020 + 3600
[GOOD]: SELECT field1, field2, field3 FROM tbl WHERE time > 1349393020 - 3600
[BAD]:  SELECT field1, field2, field3 FROM tbl WHERE time > 13493930200 / 10
[BAD]:  SELECT field1, field2, field3 FROM tbl WHERE time > 1349393020.00
[BAD]:  SELECT field1, field2, field3 FROM tbl WHERE time BETWEEN 1349392000 AND 1349394000

2.) TD_TIME_RANGE

An easier way to slice data is to use TD_TIME_RANGE UDF.

[GOOD]: SELECT ... WHERE TD_TIME_RANGE(time, '2013-01-01 PDT')
[GOOD]: SELECT ... WHERE TD_TIME_RANGE(time, '2013-01-01', NULL, 'PDT')
[GOOD]: SELECT ... WHERE TD_TIME_RANGE(time, '2013-01-01',
                                       TD_TIME_ADD('2013-01-01', '1day', 'PDT'))

However, if you use the TD_TIME_FORMAT UDF or division in TD_TIME_RANGE, time partition opimization doesn’t work. For instance, the following conditions disable optimization.

[BAD]: SELECT ... WHERE TD_TIME_RANGE(time, TD_TIME_FORMAT(TD_SCHEDULED_TIME(), 'yyyy-MM-dd'))
[BAD]: SELECT ... WHERE TD_TIME_RANGE(time, TD_TIME_FORMAT(1356998401, 'yyyy-MM-dd'))
[BAD]: SELECT ... WHERE TD_TIME_RANGE(time, TD_SCHEDULED_TIME() / 86400 * 86400))
[BAD]: SELECT ... WHERE TD_TIME_RANGE(time, 1356998401 / 86400 * 86400))

Considering the cardinality within GROUP BY

There’s a probability where GROUP BY becomes a little bit faster, by carefully ordering a list of fields within GROUP BY in an order of high cardinality.

[GOOD]: SELECT GROUP BY uid, gender
[BAD]:  SELECT GROUP BY gender, uid

Another tips about GROUP BY is to use number instead of strings as GROUP BY column because numbers require less memory are faster to compare than strings.

Use LIMIT with ORDER BY

ORDER BY requires all rows are sent to a single worker and then sort them. It may often require a lot of memory on a Presto worker. When you want to look up the top or bottom N records, please use LIMIT which can reduce the sort cost and memory pressure.

[GOOD]: SELECT * FROM tbl ORDER BY time LIMIT 100
[BAD]:  SELECT * FROM tbl ORDER BY time

Using approximate aggregate functions

Presto has a couple of approximate aggregation functions, which will give you significant performance improvements with some errors. For example, by using approx_distinct() function, you can get an approximation of COUNT(DISTINCT x) with standard error of 2.3%. The example below gives an approximate count of the previous day’s unique users.

SELECT
  approx_distinct(user_id)
FROM
  access
WHERE
  TD_TIME_RANGE(time,
    TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d', 'PDT'),
    TD_SCHEDULED_TIME())

Aggregating a series of LIKE clauses in one single regexp_like clause

Presto’s query optimizer is unable to improve queries where many LIKE clauses are used. As a consequence the query execution can be slower than expected in this case.

To improve the performance, one can substitute a series of LIKE clauses chainied in an OR with a single regexp_like clause, which is Presto native.

For example:

SELECT
  ...
FROM
  access
WHERE
  method LIKE '%GET%' OR
  method LIKE '%POST%' OR
  method LIKE '%PUT%' OR
  method LIKE '%DELETE%'

can be optimized by replacing the 4 LIKE clauses with a single regexp_like clause:

SELECT
  ...
FROM
  access
WHERE
  regexp_like(method, 'GET|POST|PUT|DELETE')

Specifying large tables first in join clause

The default join algorithm of Presto is broadcast join, which partitions the left-hand side table of a join and sends (broadcasts) a copy of the entire right-hand side table to all of the worker nodes that have the paritions. This works when your right table is small enough to fit within one node (usually less than 2GB). If you observe ‘Exceeded max memory xxGB’ error, this usually means the right-hand side table is too large. Presto does not perform automatic join-reordering, so please make sure your large table preceeds small tables in any join clause.

Turning on the distributed hash join

If you stil see the memory issue, try distributed hash join. This algorithm partitions both the left and right tables using the hash values of the join keys. So the distributed join would work even if the right-hand side table is large, but the performance can be slower because it increases the number of network data transfers. To turn on the distributed join, embed the following session property as an SQL comment:

-- set session distributed_join = 'true'
SELECT ... FROM large_table l, small_table s WHERE l.id = s.id

Note that however if there is a skew of the key values in your data, some partition needs to process larger data set compared to the others. This might slow down the query processing.

Use RANK function instead of ROW_NUMBER to get TOP N records

In some of cases, you would like to pull only latest/top N records. But, row_number() is slower than equivalent rank()

-- BAD QUERY
SELECT checksum(rnk)
FROM (
  SELECT row_number() OVER (PARTITION BY l_orderkey, l_partkey ORDER BY l_shipdate DESC) AS rnk
  FROM lineitem
) t
WHERE rnk = 1

Using RANK function would be better performance than ROW_NUMBER FUNCTION.

-- GOOD QUERY
SELECT checksum(rnk)
FROM (
  SELECT rank() OVER (PARTITION BY l_orderkey, l_partkey ORDER BY l_shipdate DESC) AS rnk
  FROM lineitem
) t
WHERE rnk = 1

Last modified: Mar 27 2017 01:44:33 UTC

If this article is incorrect or outdated, or omits critical information, please let us know. For all other issues, please see our support channels.