...
Table of Content Zone | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
Choosing Bucket Count, Partition Size in Storage, and Time Ranges for PartitionsBucket counts must be in powers of two. A higher bucket count means dividing data among many smaller partitions, which can be less efficient to scan. TD suggests starting with 512 for most cases. If you aren't sure of the best bucket count, it is safer to err on the low side. We recommend partitioning UDP tables on one-day or multiple-day time ranges, instead of the one-hour partitions most commonly used in TD. Otherwise, you might incur higher costs and slower data access because too many small partitions have to be fetched from storage. Aggregations on the Hash KeyUsing a GROUP BY key as the bucketing key, major improvements in performance and reduction in cluster load on aggregation queries were seen. For example, you can see the UDP version of this query on a 1TB table:
processing >3x as many rows per second. The total data processed in GB was greater because the UDP version of the table occupied more storage. UDP version:
non-UDP:
Needle-in-a-Haystack Lookup on the Hash KeyThe largest improvements – 5x, 10x, or more – will be on lookup or filter operations where the partition key columns are tested for equality. Only partitions in the bucket from hashing the partition keys are scanned. For example, consider:
These queries will improve:
Here UDP Presto scans only one bucket (the one that 10001 hashes to) if customer_id is the only bucketing key.
Here UDP Presto scans only the bucket that matches the hash of country_code 1 + area_code 650. These queries will not improve:
Here UDP will not improve performance, because the predicate doesn't use '='.
Here UDP will not improve performance, because the predicate does not include both bucketing keys. Very Large Join OperationsVery large join operations can sometimes run out of memory. Such joins can benefit from UDP. Distributed and colocated joins will use less memory, CPU, and shuffle less data among Presto workers. This may enable you to finish queries that would otherwise run out of resources. To leverage these benefits, you must:
Improving Performance on Skewed DataWhen processing a UDP query, Presto ordinarily creates one split of filtering work per bucket (typically 512 splits, for 512 buckets). But if data is not evenly distributed, filtering on skewed bucket could make performance worse -- one Presto worker node will handle the filtering of that skewed set of partitions, and the whole query lags. To enable higher scan parallelism you can use:
When set to true, multiple splits are used to scan the files in a bucket in parallel, increasing performance. The tradeoff is that colocated join is always disabled when distributed_bucket is true. As a result, some operations such as GROUP BY will require shuffling and more memory during execution. This query hint is most effective with needle-in-a-haystack queries. Even Even if these queries perform well with the query hint, Customers should test performance with and without the query hint in other use cases on those tables to find the best performance tradeoffs. |
...