Presto Performance Tuning

Table of Contents


Leveraging Time-based partitioning

All imported data is automatically partitioned into hourly buckets, based on the ‘time’ field within each data record. By specifying the time range in your query, you can avoid reading unnecessary data and can thus speed up your query significantly.

1.) WHERE time <=> Integer

When the ‘time’ field within the WHERE clause is specified, the query parser will automatically detect which partition(s) should be processed. Please note that this auto detection will not work if you specify the time with float instead of int.

[GOOD]: SELECT field1, field2, field3 FROM tbl WHERE time > 1349393020
[GOOD]: SELECT field1, field2, field3 FROM tbl WHERE time > 1349393020 + 3600
[GOOD]: SELECT field1, field2, field3 FROM tbl WHERE time > 1349393020 - 3600
[BAD]:  SELECT field1, field2, field3 FROM tbl WHERE time > 13493930200 / 10
[BAD]:  SELECT field1, field2, field3 FROM tbl WHERE time > 1349393020.00
[BAD]:  SELECT field1, field2, field3 FROM tbl WHERE time BETWEEN 1349392000 AND 1349394000


An easier way to slice data is to use TD_TIME_RANGE UDF.

[GOOD]: SELECT ... WHERE TD_TIME_RANGE(time, '2013-01-01 PDT')
[GOOD]: SELECT ... WHERE TD_TIME_RANGE(time, '2013-01-01', NULL, 'PDT')
[GOOD]: SELECT ... WHERE TD_TIME_RANGE(time, '2013-01-01',
                                       TD_TIME_ADD('2013-01-01', '1day', 'PDT'))

However, if you use the TD_TIME_FORMAT UDF or division in TD_TIME_RANGE, time partition opimization doesn’t work. For instance, the following conditions disable optimization.

[BAD]: SELECT ... WHERE TD_TIME_RANGE(time, TD_TIME_FORMAT(1356998401, 'yyyy-MM-dd'))
[BAD]: SELECT ... WHERE TD_TIME_RANGE(time, TD_SCHEDULED_TIME() / 86400 * 86400))
[BAD]: SELECT ... WHERE TD_TIME_RANGE(time, 1356998401 / 86400 * 86400))

Considering the cardinality within GROUP BY

There’s a probability where GROUP BY becomes a little bit faster, by carefully ordering a list of fields within GROUP BY in an order of high cardinality.

[GOOD]: SELECT GROUP BY uid, gender
[BAD]:  SELECT GROUP BY gender, uid

Using approximate aggregate functions

Presto has a couple of approximate aggregation functions, which will give you significant performance improvements with some errors. For example, by using approx_distinct() function, you can get an approximation of COUNT(DISTINCT x) with standard error of 2.3%. The example below gives an approximate count of the previous day’s unique users.


Aggregating a series of LIKE clauses in one single regexp_like clause

Presto’s query optimizer is unable to improve queries where many LIKE clauses are used. As a consequence the query execution can be slower than expected in this case.

To improve the performance, one can substitute a series of LIKE clauses chainied in an OR with a single regexp_like clause, which is Presto native.

For example:

  method LIKE '%GET%' OR
  method LIKE '%POST%' OR
  method LIKE '%PUT%' OR
  method LIKE '%DELETE%'

can be optimized by replacing the 4 LIKE clauses with a single regexp_like clause:

  regexp_like(method, 'GET|POST|PUT|DELETE')

Specifying large tables first in join clause

The default join algorithm of Presto is broadcast join, which partitions the left-hand side table of a join and sends (broadcasts) a copy of the entire right-hand side table to all of the worker nodes that have the paritions. This works when your right table is small enough to fit within one node (usually less than 2GB). If you observe ‘Exceeded max memory xxGB’ error, this usually means the right-hand side table is too large. Presto does not perform automatic join-reordering, so please make sure your large table preceeds small tables in any join clause.

Turning on the distributed hash join

If you stil see the memory issue, try distributed hash join. This algorithm partitions both the left and right tables using the hash values of the join keys. So the distributed join would work even if the right-hand side table is large, but the performance can be slower because it increases the number of network data transfers. To turn on the distributed join, embed the following session property as an SQL comment:

-- set session distributed_join = 'true'
SELECT ... FROM large_table l, small_table s WHERE =

Note that however if there is a skew of the key values in your data, some partition needs to process larger data set compared to the others. This might slow down the query processing.

Last modified: Nov 13 2015 02:49:40 UTC

If this article is incorrect or outdated, or omits critical information, please let us know. For all other issues, please see our support channels.