Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

UDP can help with these Presto query types:

...

UDP Advanced Use Case Details

Table of Contents
Table of Contents-zone
minLevel2
maxLevel3minLevel2

Choosing Bucket Count, Partition Size in storage and Time Ranges for Partitions

Bucket counts must be powers of two. Higher bucket count means dividing data among many smaller partitions, which can be less efficient to scan. TD suggests starting with 512 for most cases. If you aren't sure of the best bucket count, it is safer to err on the low side. 

We recommend partitioning UDP tables on one-day or multiple-day time ranges, instead of the one-hour partitions most commonly used in TD. Otherwise you can wind up with higher costs and slower data access because too many small partitions have to be fetched from storage.

Aggregations on the hash key

Using a GROUP BY key as the bucketing key, major improvements in performance and reduction in cluster load on aggregation queries were seen. For example, you can see the UDP version of this query on a 1TB table:

  • used 10 Presto workers instead of 19

  • ran in 45 seconds instead of 2 minutes 31 seconds

processing >3x as many rows per second. (The total data processed in GB was greater because the UDP version of the table occupied more storage.)

 

UDP version:

Code Block
presto: udp_tpch_sf1000 > SELECT 
  COUNT(*)
FROM
  (
    SELECT 
      max_by(
        l_discount,
        time
      ),
      max_by(
        l_partkey,
        time
      )
    FROM
      lineitem
    GROUP BY
      l_orderkey
  )
;

_col0------------
1500000000(
  1 row
) Query 20171227_014452_14154_sjh9g,
FINISHED,
10 nodes Splits: 517 total,
517 done(
  100.00 %
) 0: 45 [6B ROWS,
25.5GB] [134M ROWS / s,
585MB / s]

non-UDP:

Code Block
presto: udp_tpch_sf1000 > SELECT 
  COUNT(*)
FROM
  (
    SELECT 
      max_by(
        l_discount,
        time
      ),
      max_by(
        l_partkey,
        time
      )
    FROM
      tpch_sf1000.lineitem
    GROUP BY
      l_orderkey
  )
;

_col0------------
1500000000(
  1 row
) Query 20171227_014549_14273_sjh9g,
FINISHED,
19 nodes Splits: 175 total,
175 done(
  100.00 %
) 2: 31 [6B ROWS,
18.3GB] [39.7M ROWS / s,
124MB / s]

Needle-in-a-Haystack Lookup on the Hash Key

The largest improvements – 5x, 10x or more – will be on lookup or filter operations where the partition key column are tested for equality. Only partitions in the bucket from hashing the partition keys are scanned.

For example, consider two tables:

  • table customers is bucketed on customer_id

  • table contacts is bucketed on country_code and area_code

 These queries will improve:

Code Block
SELECT... FROM customers WHERE customer_id = 10001;

Here with UDP Presto scans only one bucket (the one that 10001 hashes to) if customer_id is the only bucketing key.

Code Block
SELECT... FROM contacts WHERE country_code='1' and area_code = '650' and phone like'555-____'; 

Here with UDP Presto scans only the bucket that matches the hash of country_code 1 + area_code 650.

 These queries will not improve:

Code Block
SELECT... FROM customers WHERE customer_id >= 10001; 

Here UDP will not improve performance, because the predicate doesn't use '='.

Code Block
SELECT... FROM contacts WHERE area_code = '650' ; 

Here UDP will not improve performance, because the predicate does not include both bucketing keys.

Very Large Join Operations 

Very large join operations can sometimes run out of memory. Such joins can benefit from UDP. Distributed and colocated joins will use less memory, CPU, and shuffle less data among Presto workers. This may enable you to finish queries that would otherwise run out of resources. To leverage these benefits, you must:

  1. Make sure the two tables to be joined are partitioned on the same keys

  2. Use equi-joins across all the partitioning keys

  3. Set the following options on your join using a magic comment:

Code Block
-- set session distributed_join = 'true' 
-- set session colocated_join = 'true'

Improving Performance on Skewed Data

When processing a UDP query, Presto ordinarily creates one split of filtering work per bucket (typically 512 splits, for 512 buckets). But if data is not evenly distributed, filtering on skewed bucket could make performance worse -- one Presto worker node will handle the filtering of that skewed set of partitions, and the whole query lags.

To enable higher scan parallelism you can use:

Code Block
-- set session distributed_bucket='true|false'

When set to true, multiple splits are used to scan the files in a bucket in parallel, increasing performance. The tradeoff is that colocated join is always disabled when distributed_bucket is true. As a result, some operations such as GROUP BY will require shuffling and more memory during execution.

This query hint is most effective with needle-in-a-haystack queries. Even if these queries perform well with the query hint, Customers should test performance with and without the query hint in other use cases on those tables to find the best performance tradeoffs. 

...