Presto 0.205 to 317 Migration 2020

Treasure Data’s Presto release is based on the Presto 317 open source release. Our previous release was based on Presto 0.205.  This article includes:


We have done internal testing to identify compatibility and performance regression issues. 

New Magic Comment Features

parse_decimal_literals_as_double

The parse_decimal_literals_as_double session property is supported as a magic comment.

Decimal literals without an explicit type specifier (e.g. 1.2) are treated as the values of the DOUBLE type by default.

This magic comment can disable the behavior and use decimal literals as the values of the DECIMAL type. For example:

-- set session parse_decimal_literals_as_double = 'false'
SELECT typeof(1.1);

distributed_sort

The distributed_sort session property is supported as a magic comment.

Distributed sort allows sorting of data, which exceeds query.max-memory-per-node. Distributed sort is enabled through the distributed_sort session property, or distributed-sort configuration property set in etc/config.properties of the coordinator. Distributed sort is enabled by default.

When distributed sort is enabled, the sort operator executes in parallel on multiple nodes in the cluster. Partially sorted data from each worker node is then streamed to a single worker node for a final merge. This technique allows to utilization of memory of multiple worker nodes for sorting. The primary purpose of the distributed sort is to allow for the sorting of data sets that don’t normally fit into single node memory. Performance improvement can be expected, but it won’t scale linearly with the number of nodes, since the data needs to be merged by a single node.



-- set session distributed_sort = 'true'
SELECT * FROM large_table, small_table
WHERE small_table.id = large_table.id

time_partitioning_range

The time_partitioning_range session property is supported as a magic comment.

-- set session time_partitioning_range = 'value`

The value can be

  • none

    • no time partitioning

  • number with unit

    • allowed units are h, d, mo, q, y (hour, day, month, quarter, year)

    • ex) 12h, 2d, 1mo, 1q, 1y

Deprecated Features

Migrate Your Distributed Joins

The distributed_join session property has been removed.

Use join_distribution_type session property instead. The join_distribution_type system property accepts the following values:

  • PARTITIONED

  • BROADCAST

Deprecated code:

-- set session distributed_join = 'true'
SELECT * FROM large_table, small_table
WHERE small_table.id = large_table.id

Updated code:

-- set session join_distribution_type = 'PARTITIONED'
SELECT * 
FROM large_table, small_table
WHERE small_table.id = large_table.id


Updating Presto

About Side-by-Side Environments

For your testing, Presto 0.205 and Presto 317 are available side-by-side to allow testing of the code before the upgrade.

Running Code in Presto 317 vs. Presto 0.205

During the transition from Presto 317 and Presto 0.205 the two releases will be available side-by-side.

You can control which Presto release runs your code by using the following query hint:

Version

Query Hint

Presto 317

  • - @TD engine_version: 317

Presto 0.205

  • - @TD engine_version: 0.205

Query Syntax Fixes

Running queries that worked in the previous version might cause error messages. For each case, we show the message, a query that causes the error, and an example of how to rewrite the query.

The messages are:

Having too many OR predicate. Maximum 500 OR allowed

Original code that causes the error:

SELECT path FROM  ( 
 SELECT * FROM sample_datasets.www_access ) WHERE 
path = 'foo1' OR path = 'foo2' OR path = 'foo3'
-- Query having a lot of "OR" clauses
OR path = 'foo559';

Modified code that fixes the error:

SELECT path FROM ( 
 SELECT * FROM sample_datasets.www_access ) WHERE
path IN (
 'foo1', 'foo2', 'foo3' ,
-- Query having a lot of conditions
'foo559');

Typically, using an IN clause to rewrite your queries solves the issue.


Key not present in map: XXX

Original code that causes the error:

SELECT MAP(ARRAY [1, 3], ARRAY [2, 4])[5];

Modified code that fixes the error:

SELECT element_at(MAP(ARRAY [1, 3], ARRAY [2, 4]), 5);

You can use the function element_at to rewrite your queries. The function element_at is supported by both cluster versions, v0.205, and v317. The preceding query runs on both clusters.


Array subscript out of bounds

Original code that causes the error:

SELECT
 numbers[99] -- key isn't exist
FROM
 (VALUES (ARRAY[1,2,3]) ) AS t(numbers);

Modified code that fixes the error:

SELECT
  element_at(numbers, 99) -- key isn't exist
FROM
  (VALUES (ARRAY[1,2,3]) ) AS t(numbers);

The function element_at is supported by both cluster versions, v0.205 and v317.

The preceding query run on both clusters.


Column alias list has 1 entries but \'t\' has N columns available'

Original code that causes the error:

WITH
dataset AS (
 SELECT ARRAY[
 CAST(ROW('Amy', 'devops') AS ROW(name VARCHAR, department VARCHAR))
 ] AS users
),
u AS (
 SELECT person
 FROM
 dataset,
 UNNEST(dataset.users) AS t(person)
 )
SELECT
 person.name,
 person.department
FROM u;

Modified code that fixes the error:

WITH
dataset AS (
 SELECT ARRAY[
 CAST(ROW('Amy', 'devops') AS ROW(name VARCHAR, department VARCHAR))
 ] AS users
),
u AS (
 SELECT
 name, department
 FROM
 dataset,
 UNNEST(dataset.users) AS person(name, department)
 )
SELECT
 name,
 department
FROM u;


window does not support distinct

Original code that causes the error:

SELECT 
  UPPER(DISTINCT(id)) AS user_id 
FROM (VALUES ('foo'), ('bar')) AS t(id);

Modified code that fixes the error:

 SELECT 
  UPPER(id) AS user_id 
FROM (
  SELECT DISTINCT id FROM (VALUES ('foo'), ('bar')
) AS t (id));

ARRAY comparison not supported for arrays with null elements

Original code that causes the error:

SELECT id, array_agg(value)
FROM (
 VALUES
 (1, 3), (1, 4), (1, 5), (2, 6), (2, 7), (3, null)
) AS t(id, value)
GROUP BY 1 ORDER BY 2;

Modified code that fixes the error:

 SELECT id, array_agg(value) FILTER (where value is not null)
FROM (
 VALUES
 (1, 3), (1, 4), (1, 5), (2, 6), (2, 7), (3, null)
) AS t(id, value)
GROUP BY 1 ORDER BY 2;

The function element_at is supported by both cluster versions, v0.205 and v317. The preceding query runs on both clusters.

Details of this error: Behavior change of array_agg function

For Presto 0.205, the array_agg function ignores NULL values.

presto> SELECT id, array_agg(value) FROM ( VALUES (1, null), (1, 0) ) AS t(id, value) GROUP BY 1;
 id | _col1
----+-------
 1 | [0]
(1 row)

presto> SELECT id, array_agg(value) FROM ( VALUES (1, null) ) AS t(id, value) GROUP BY 1;
 id | _col1
----+-------
 1 | NULL
(1 row)

In Presto 0.205, deprecated.legacyagg property, true is enabled in order to keep this legacy behavior of the array_agg function.

This property was removed in Presto 317 and there is no ability to roll back the legacy behavior.

For Presto 317, the array_agg function doesn't ignore NULL values and returns an array including NULL values.

presto> SELECT id, array_agg(value) FROM ( VALUES (1, null), (1, 0) ) AS t(id, value) GROUP BY 1;
 id | _col1
----+-----------
 1 | [null, 0]
(1 row)

presto> SELECT id, array_agg(value) FROM ( VALUES (1, null) ) AS t(id, value) GROUP BY 1;
 id | _col1
----+--------
 1 | [null]
(1 row)

As a result, when Presto compares with a column that has arrays with NULL values, Presto returns the ARRAY comparison not supported for arrays with null elements message.

'XXX' must be an aggregate expression or appear in a GROUP BY clause

Original code that causes the error:

SELECT code FROM sample_datasets.www_access 
HAVING code IS NOT NULL;

Modified code that fixes the error:

SELECT code FROM sample_datasets.www_access 
WHERE code IS NOT NULL;

If you use a HAVING clause as a WHERE clause, this error appears.


Zero-length delimited identifier not allowed

Original code that causes the error:

SELECT 1 AS "";

Modified code that fixes the error:

SELECT 1 AS "A";

Unexpected parameters (char(1)) for function codepoint. Expected: codepoint(varchar(1))

Original code that causes the error:

SELECT codepoint(CAST('a' AS CHAR));

Modified code that fixes the error:

SELECT codepoint(CAST('a' AS VARCHAR(1)));

Invalid format: "XXX" is malformed / Invalid timezone: "XXX" is malformed

Error handling has been improved in v317. When an invalid argument gets passed, the current Presto returns NULL, but the new Presto returns the error.

Original code that causes the error:

SELECT td_time_range(1567890000, 'yyyy-MM-dd','JST', NULL);

Modified code that fixes the error:

SELECT td_time_range(1567890000, '2019-09-08', 'JST', NULL);