Skip to content
Last updated

Presto 0 205 to 317 Migration 2020

Treasure Data’s Presto release is based on the Presto 317 open source release. Our previous release was based on Presto 0.205. This article includes:

We have done internal testing to identify compatibility and performance regression issues.

New Magic Comment Features

parse_decimal_literals_as_double

The parse_decimal_literals_as_double session property is supported as a magic comment.

Decimal literals without an explicit type specifier (e.g. 1.2) are treated as the values of the DOUBLE type by default.

This magic comment can disable the behavior and use decimal literals as the values of the DECIMAL type. For example:

-- set session parse_decimal_literals_as_double = 'false'
    SELECT typeof(1.1);  

Distributed_Sort

Thedistributed_sort session property is supported as a magic comment.

Distributed sort allows sorting of data, which exceeds query.max-memory-per-node. Distributed sort is enabled through the distributed_sort session property, or distributed-sort configuration property set in etc/config.properties of the coordinator. Distributed sort is enabled by default.

When distributed sort is enabled, the sort operator executes in parallel on multiple nodes in the cluster. Partially sorted data from each worker node is then streamed to a single worker node for a final merge. This technique allows to utilization of memory of multiple worker nodes for sorting. The primary purpose of the distributed sort is to allow for the sorting of data sets that don’t normally fit into single node memory. Performance improvement can be expected, but it won’t scale linearly with the number of nodes, since the data needs to be merged by a single node.

-- set session distributed_sort = 'true'
    SELECT * FROM large_table, small_table
    WHERE small_table.id = large_table.id  

time_partitioning_range

The time_partitioning_range session property is supported as a magic comment.

-- set session time_partitioning_range = 'value`  
  
The value can be

  * none

    * no time partitioning

  * number with unit

    * allowed units are `h, d, mo, q, y` (hour, day, month, quarter, year)

    * ex) 12h, 2d, 1mo, 1q, 1y

# Deprecated Features

## Migrate Your Distributed Joins

The **distributed_join** session property has been removed.

Use **join_distribution_type** session property instead. The **join_distribution_type** system property accepts the following values:

  * PARTITIONED

  * BROADCAST

Deprecated code:

```sql
-- set session distributed_join = 'true'
    SELECT * FROM large_table, small_table
    WHERE small_table.id = large_table.id  
  
Updated code:

```sql
-- set session join_distribution_type = 'PARTITIONED'
    SELECT * 
    FROM large_table, small_table
    WHERE small_table.id = large_table.id  

Updating Presto

About Side-by-Side Environments

For your testing, Presto 0.205 and Presto 317 are available side-by-side to allow testing of the code before the upgrade.

Running Code in Presto 317 vs. Presto 0.205

During the transition from Presto 317 and Presto 0.205 the two releases will be available side-by-side.

You can control which Presto release runs your code by using the following query hint:

VersionQuery Hint
Presto 317-- @TD engine_version: 317
Presto 0.205-- @TD engine_version: 0.205

Query Syntax Fixes

Running queries that worked in the previous version might cause error messages. For each case, we show the message, a query that causes the error, and an example of how to rewrite the query.

Having too many OR predicate. Maximum 500 OR allowed

Original code that causes the error:

SELECT path FROM  ( 
     SELECT * FROM sample_datasets.www_access ) WHERE 
    path = 'foo1' OR path = 'foo2' OR path = 'foo3'
    -- Query having a lot of "OR" clauses
    OR path = 'foo559';  

Modified code that fixes the error:

SELECT path FROM ( 
     SELECT * FROM sample_datasets.www_access ) WHERE
    path IN (
     'foo1', 'foo2', 'foo3' ,
    -- Query having a lot of conditions
    'foo559');  

Typically, using an IN clause to rewrite your queries solves the issue.

Key not present in map: XXX

Original code that causes the error:

SELECT MAP(ARRAY [1, 3], ARRAY [2, 4])[5];  

Modified code that fixes the error:

SELECT element_at(MAP(ARRAY [1, 3], ARRAY [2, 4]), 5);  

You can use the function element_at to rewrite your queries. The function element_at is supported by both cluster versions, v0.205, and v317. The preceding query runs on both clusters.

Array subscript out of bounds

Original code that causes the error:

SELECT
     numbers[99] -- key isn't exist
    FROM
     (VALUES (ARRAY[1,2,3]) ) AS t(numbers);  

Modified code that fixes the error:

SELECT
      element_at(numbers, 99) -- key isn't exist
    FROM
      (VALUES (ARRAY[1,2,3]) ) AS t(numbers);  

The function element_at is supported by both cluster versions, v0.205, and v317.

The preceding query runs on both clusters.

Column alias list has 1 entry but 't' has N columns available'

Original code that causes the error:

WITH
    dataset AS (
     SELECT ARRAY[
     CAST(ROW('Amy', 'devops') AS ROW(name VARCHAR, department VARCHAR))
     ] AS users
    ),
    u AS (
     SELECT person
     FROM
     dataset,
     UNNEST(dataset.users) AS t(person)
     )
    SELECT
     person.name,
     person.department
    FROM u;  

Modified code that fixes the error:

WITH
    dataset AS (
     SELECT ARRAY[
     CAST(ROW('Amy', 'devops') AS ROW(name VARCHAR, department VARCHAR))
     ] AS users
    ),
    u AS (
     SELECT
     name, department
     FROM
     dataset,
     UNNEST(dataset.users) AS person(name, department)
     )
    SELECT
     name,
     department
    FROM u;  

Window does not support distinct

Original code that causes the error:

SELECT 
      UPPER(DISTINCT(id)) AS user_id 
    FROM (VALUES ('foo'), ('bar')) AS t(id);  

Modified code that fixes the error:

SELECT 
      UPPER(id) AS user_id 
    FROM (
      SELECT DISTINCT id FROM (VALUES ('foo'), ('bar')
    ) AS t (id));  

ARRAY comparison not supported for arrays with null elements

Original code that causes the error:

SELECT id, array_agg(value)
    FROM (
     VALUES
     (1, 3), (1, 4), (1, 5), (2, 6), (2, 7), (3, null)
    ) AS t(id, value)
    GROUP BY 1 ORDER BY 2;  

Modified code that fixes the error:

SELECT id, array_agg(value) FILTER (where value is not null)
    FROM (
     VALUES
     (1, 3), (1, 4), (1, 5), (2, 6), (2, 7), (3, null)
    ) AS t(id, value)
    GROUP BY 1 ORDER BY 2;  

The function element_at is supported by both cluster versions, v0.205, and v317. The preceding query runs on both clusters.

Details of this error: Behavior change of array_agg function

For Presto 0.205, the array_agg function ignores NULL values.

presto> SELECT id, array_agg(value) FROM ( VALUES (1, null), (1, 0) ) AS t(id, value) GROUP BY 1;
     id | _col1
    ----+-------
     1 | [0]
    (1 row)
    
    presto> SELECT id, array_agg(value) FROM ( VALUES (1, null) ) AS t(id, value) GROUP BY 1;
     id | _col1
    ----+-------
     1 | NULL
    (1 row)  
  
In Presto 0.205, _deprecated_._legacyagg_ property, true is enabled in order to keep this legacy behavior of the array_agg function.

This property was removed in Presto 317 and there is no ability to roll back the legacy behavior.

For Presto 317, the array_agg function doesn't ignore NULL values and returns an array including NULL values.

presto> SELECT id, array_agg(value) FROM ( VALUES (1, null), (1, 0) ) AS t(id, value) GROUP BY 1; id | _col1 ----+----------- 1 | [null, 0] (1 row)

presto> SELECT id, array_agg(value) FROM ( VALUES (1, null) ) AS t(id, value) GROUP BY 1; id | _col1 ----+-------- 1 | [null] (1 row)

As a result, when Presto compares with a column that has arrays with NULL values, Presto returns the ARRAY comparison not supported for arrays with null elements message.

'XXX' must be an aggregate expression or appear in a GROUP BY clause

Original code that causes the error:

SELECT code FROM sample_datasets.www_access 
    HAVING code IS NOT NULL;  

Modified code that fixes the error:

SELECT code FROM sample_datasets.www_access 
    WHERE code IS NOT NULL;  

If you use a HAVING clause as a WHERE clause, this error appears.

Zero-length delimited identifier not allowed

Original code that causes the error:

SELECT 1 AS "";  

Modified code that fixes the error:

SELECT 1 AS "A";  

Unexpected parameters (char(1)) for function codepoint. Expected: codepoint(varchar(1))

Original code that causes the error:

SELECT codepoint(CAST('a' AS CHAR));  

Modified code that fixes the error:

SELECT codepoint(CAST('a' AS VARCHAR(1)));  

Invalid format: "XXX" is malformed / Invalid timezone: "XXX" is malformed

Error handling has been improved in v317. When an invalid argument gets passed, the current Presto returns NULL, but the new Presto returns the error.

Original code that causes the error:

SELECT td_time_range(1567890000, 'yyyy-MM-dd','JST', NULL);  

Modified code that fixes the error:

SELECT td_time_range(1567890000, '2019-09-08', 'JST', NULL);