Treasure Data’s Presto release is based on the Presto 317 open source release. Our previous release was based on Presto 0.205. This article includes:
- New Magic Comment Features
- parse_decimal_literals_as_double
- Distributed_Sort
- time_partitioning_range
- Deprecated Features
- Migrate Your Distributed Joins
- Updating Presto
- About Side-by-Side Environments
- Running Code in Presto 317 vs. Presto 0.205
- Query Syntax Fixes
- Having too many OR predicate. Maximum 500 OR allowed
- Key not present in map: XXX
- Array subscript out of bounds
- Column alias list has 1 entry but 't' has N columns available'
- Window does not support distinct
- ARRAY comparison not supported for arrays with null elements
- 'XXX' must be an aggregate expression or appear in a GROUP BY clause
- Zero-length delimited identifier not allowed
- Unexpected parameters (char(1)) for function codepoint. Expected: codepoint(varchar(1))
- Invalid format: "XXX" is malformed / Invalid timezone: "XXX" is malformed
We have done internal testing to identify compatibility and performance regression issues.
The parse_decimal_literals_as_double session property is supported as a magic comment.
Decimal literals without an explicit type specifier (e.g. 1.2) are treated as the values of the DOUBLE type by default.
This magic comment can disable the behavior and use decimal literals as the values of the DECIMAL type. For example:
-- set session parse_decimal_literals_as_double = 'false'
SELECT typeof(1.1); Thedistributed_sort session property is supported as a magic comment.
Distributed sort allows sorting of data, which exceeds query.max-memory-per-node. Distributed sort is enabled through the distributed_sort session property, or distributed-sort configuration property set in etc/config.properties of the coordinator. Distributed sort is enabled by default.
When distributed sort is enabled, the sort operator executes in parallel on multiple nodes in the cluster. Partially sorted data from each worker node is then streamed to a single worker node for a final merge. This technique allows to utilization of memory of multiple worker nodes for sorting. The primary purpose of the distributed sort is to allow for the sorting of data sets that don’t normally fit into single node memory. Performance improvement can be expected, but it won’t scale linearly with the number of nodes, since the data needs to be merged by a single node.
-- set session distributed_sort = 'true'
SELECT * FROM large_table, small_table
WHERE small_table.id = large_table.id The time_partitioning_range session property is supported as a magic comment.
-- set session time_partitioning_range = 'value`
The value can be
* none
* no time partitioning
* number with unit
* allowed units are `h, d, mo, q, y` (hour, day, month, quarter, year)
* ex) 12h, 2d, 1mo, 1q, 1y
# Deprecated Features
## Migrate Your Distributed Joins
The **distributed_join** session property has been removed.
Use **join_distribution_type** session property instead. The **join_distribution_type** system property accepts the following values:
* PARTITIONED
* BROADCAST
Deprecated code:
```sql
-- set session distributed_join = 'true'
SELECT * FROM large_table, small_table
WHERE small_table.id = large_table.id
Updated code:
```sql
-- set session join_distribution_type = 'PARTITIONED'
SELECT *
FROM large_table, small_table
WHERE small_table.id = large_table.id For your testing, Presto 0.205 and Presto 317 are available side-by-side to allow testing of the code before the upgrade.
During the transition from Presto 317 and Presto 0.205 the two releases will be available side-by-side.
You can control which Presto release runs your code by using the following query hint:
| Version | Query Hint |
|---|---|
| Presto 317 | -- @TD engine_version: 317 |
| Presto 0.205 | -- @TD engine_version: 0.205 |
Running queries that worked in the previous version might cause error messages. For each case, we show the message, a query that causes the error, and an example of how to rewrite the query.
Original code that causes the error:
SELECT path FROM (
SELECT * FROM sample_datasets.www_access ) WHERE
path = 'foo1' OR path = 'foo2' OR path = 'foo3'
-- Query having a lot of "OR" clauses
OR path = 'foo559'; Modified code that fixes the error:
SELECT path FROM (
SELECT * FROM sample_datasets.www_access ) WHERE
path IN (
'foo1', 'foo2', 'foo3' ,
-- Query having a lot of conditions
'foo559'); Typically, using an IN clause to rewrite your queries solves the issue.
Key not present in map: XXX
Original code that causes the error:
SELECT MAP(ARRAY [1, 3], ARRAY [2, 4])[5]; Modified code that fixes the error:
SELECT element_at(MAP(ARRAY [1, 3], ARRAY [2, 4]), 5); You can use the function element_at to rewrite your queries. The function element_at is supported by both cluster versions, v0.205, and v317. The preceding query runs on both clusters.
Array subscript out of bounds
Original code that causes the error:
SELECT
numbers[99] -- key isn't exist
FROM
(VALUES (ARRAY[1,2,3]) ) AS t(numbers); Modified code that fixes the error:
SELECT
element_at(numbers, 99) -- key isn't exist
FROM
(VALUES (ARRAY[1,2,3]) ) AS t(numbers); The function element_at is supported by both cluster versions, v0.205, and v317.
The preceding query runs on both clusters.
Column alias list has 1 entry but 't' has N columns available'
Original code that causes the error:
WITH
dataset AS (
SELECT ARRAY[
CAST(ROW('Amy', 'devops') AS ROW(name VARCHAR, department VARCHAR))
] AS users
),
u AS (
SELECT person
FROM
dataset,
UNNEST(dataset.users) AS t(person)
)
SELECT
person.name,
person.department
FROM u; Modified code that fixes the error:
WITH
dataset AS (
SELECT ARRAY[
CAST(ROW('Amy', 'devops') AS ROW(name VARCHAR, department VARCHAR))
] AS users
),
u AS (
SELECT
name, department
FROM
dataset,
UNNEST(dataset.users) AS person(name, department)
)
SELECT
name,
department
FROM u; Window does not support distinct
Original code that causes the error:
SELECT
UPPER(DISTINCT(id)) AS user_id
FROM (VALUES ('foo'), ('bar')) AS t(id); Modified code that fixes the error:
SELECT
UPPER(id) AS user_id
FROM (
SELECT DISTINCT id FROM (VALUES ('foo'), ('bar')
) AS t (id)); Original code that causes the error:
SELECT id, array_agg(value)
FROM (
VALUES
(1, 3), (1, 4), (1, 5), (2, 6), (2, 7), (3, null)
) AS t(id, value)
GROUP BY 1 ORDER BY 2; Modified code that fixes the error:
SELECT id, array_agg(value) FILTER (where value is not null)
FROM (
VALUES
(1, 3), (1, 4), (1, 5), (2, 6), (2, 7), (3, null)
) AS t(id, value)
GROUP BY 1 ORDER BY 2; The function element_at is supported by both cluster versions, v0.205, and v317. The preceding query runs on both clusters.
For Presto 0.205, the array_agg function ignores NULL values.
presto> SELECT id, array_agg(value) FROM ( VALUES (1, null), (1, 0) ) AS t(id, value) GROUP BY 1;
id | _col1
----+-------
1 | [0]
(1 row)
presto> SELECT id, array_agg(value) FROM ( VALUES (1, null) ) AS t(id, value) GROUP BY 1;
id | _col1
----+-------
1 | NULL
(1 row)
In Presto 0.205, _deprecated_._legacyagg_ property, true is enabled in order to keep this legacy behavior of the array_agg function.
This property was removed in Presto 317 and there is no ability to roll back the legacy behavior.
For Presto 317, the array_agg function doesn't ignore NULL values and returns an array including NULL values.
presto> SELECT id, array_agg(value) FROM ( VALUES (1, null), (1, 0) ) AS t(id, value) GROUP BY 1; id | _col1 ----+----------- 1 | [null, 0] (1 row)
presto> SELECT id, array_agg(value) FROM ( VALUES (1, null) ) AS t(id, value) GROUP BY 1; id | _col1 ----+-------- 1 | [null] (1 row)
As a result, when Presto compares with a column that has arrays with NULL values, Presto returns the ARRAY comparison not supported for arrays with null elements message.
Original code that causes the error:
SELECT code FROM sample_datasets.www_access
HAVING code IS NOT NULL; Modified code that fixes the error:
SELECT code FROM sample_datasets.www_access
WHERE code IS NOT NULL; If you use a HAVING clause as a WHERE clause, this error appears.
Zero-length delimited identifier not allowed
Original code that causes the error:
SELECT 1 AS ""; Modified code that fixes the error:
SELECT 1 AS "A"; Original code that causes the error:
SELECT codepoint(CAST('a' AS CHAR)); Modified code that fixes the error:
SELECT codepoint(CAST('a' AS VARCHAR(1))); Error handling has been improved in v317. When an invalid argument gets passed, the current Presto returns NULL, but the new Presto returns the error.
Original code that causes the error:
SELECT td_time_range(1567890000, 'yyyy-MM-dd','JST', NULL); Modified code that fixes the error:
SELECT td_time_range(1567890000, '2019-09-08', 'JST', NULL);