This document summarizes new features and changes in Presto 350.
Performance Improvements
Here is a list of updates related to performance improvements:
Performance improvements in querying information _schema tables (#999, #1306, #1329, #1543, #2488)
Pushdown support for some functions and operators in PostgreSQL (DataTank) connector (#3881, #4111, #4112, #5261)
Improved query performance by reducing worker to worker communication overhead. (#6126, #5905, #5949)
Improved performance of
ORDER BY ... LIMIT
queries. (#6072)Improved performance of queries with uncorrelated
IN
clauses. (#5582)Improved performance of queries that use the
decimal
type. (#4730, #4886, #5181)Improved performance and accuracy of
approx_percentile()
. (#5158)Improved performance of certain cross join queries. (#5276)
Reduced latency for queries that perform a broadcast join of a large table. (#5237)
Improved performance of queries involving comparisons between
DOUBLE
orREAL
values and integer values. (#3533)Improved performance of queries involving
row_number()
. (#3614)Improved performance of queries containing
LIKE
predicate. (#3618)Improved performance for queries that read fields from nested structures. (#2672)
Improved performance of queries involving constant scalar subqueries (#3432)
Improved query performance by removing redundant data reshuffling. (#2853)
Improved performance of inequality joins involving
BETWEEN
. (#2859)Improved join performance for dictionary encoded data. (#2862)
Improved performance of queries containing redundant scalar subqueries. (#2456)
Improved performance of
INSERT
andCREATE TABLE ... AS
queries containing redundantORDER BY
clauses. (#2044)Improved performance when processing columns of
map
type. (#2015)Reduced query memory usage by improving retained size estimation for
VARCHAR
andCHAR
types. (#4123)Improved performance of certain join queries by reducing the amount of data that needs to be scanned. (#1673)
Improved performance of queries containing complex predicates. (#1515)
Improved performance of certain window functions when using bounded window frames (e.g.,
ROWS BETWEEN ... PRECEDING AND ... FOLLOWING
). (#464)Improved performance of certain queries involving coercions and complex expressions in
JOIN
conditions. (#1390)
New UDFs
Many new UDFs are available. See the Trino official documentation to understand the usage of each functions.
contains_sequence()
concat_ws()
murmur3()
from_unixtime_nanos()
T-Digest functions: T-Digest functions — Trino 361 Documentation
from_iso8601_timestamp_nanos()
human_readable_seconds()
bitwise_left_shift(), bitwise_right_shift() and bitwise_right_shift_arithmetic()
luhn_check()
approx_most_frequent()
random(m, n)
starts_with()
regexp_count(), regexp_position()
strpos(string, substring, instance)
Geospatial functions: to_encoded_polyline(), from_encoded_polyline(), line_interpolate_point(), line_interpolate_points(), geometry_from_hadoop_shape(), ST_Length(SphericalGeography)
at_timezone(), with_timezone()
last_day_of_month()
New SQL support
Add IF EXISTS and IF NOT EXISTS syntax to ALTER TABLE. (#4651)
Add support for INTERSECT ALL and EXCEPT ALL. (#2152)
Add support for DISTINCT clause in aggregations within correlated subqueries. (#5904)
Add support for
RANGE BETWEEN <value> PRECEDING AND <value> FOLLOWING
window frames. (#609)Add support for window frames based on
GROUPS
. (#5713)Add support for
extract()
withTIMEZONE_HOUR
andTIMEZONE_MINUTE
fortime with time zone
values. (#5668)Add support for correlated subqueries in recursive queries. (#4877)
Add support for
IN
predicate with subqueries in outer join condition. (#4151)Add support for quantified comparisons (e.g.,
> ALL (...)
) in aggregation queries. (#4128)Add support for variable-precision
TIME
type. (#4381)Add support for variable precision
TIME WITH TIME ZONE
type. (#4905)Add support for variable-precision
TIMESTAMP
(without time zone) type. (#3783)Add support for variable-precision
TIMESTAMP WITH TIME ZONE
type (#3947)Allow inserting values of a larger type into as smaller type when the values fit. For example, BIGINT into SMALLINT, or VARCHAR(10) into VARCHAR(3). Values that don’t fit will cause an error at runtime. (#2061)
Allow using
.*
on expressions of typeROW
in theSELECT
clause to convert the fields of a row into multiple columns. (#1017)Allow references to tables in the enclosing query when using .*. (#1867)
Add support for IGNORE NULLS for window functions. (#1244)
Add support for INNER and OUTER joins involving UNNEST. (#1522)
JDBC Improvements
Query parameters are supported in LIMIT, OFFSET and FETCH FIRST clauses.
Connection conn = ... PreparedStatement stmt = conn.prepareStatement( "SELECT * FROM sample_datasets.www_access OFFSET ? LIMIT ?"); stmt.setInt(1, 10); // OFFSET = 10 stmt.setInt(2, 20); // LIMIT = 20 ResultSet rs = stmt.executeQuery(); ...
time_partitioning_shuffle magic comment for INSERT/CTAS
Treasure Data supports a TD-specific setting to customize a partition size for INSERT/CTAS queries for better query performance. The following is an example of a time_partitioning_range_session property that is supported as a magic comment.
-- set session time_partitioning_range = '12h'
This comment can also be used to disable shuffling by specifying no-shuffle
as a value. In this case, however, the partition size cannot be customized. In Presto 350, we introduced a new magic comment time_partitioning_shuffle = 'true'|'false'
so that you can specify partition shuffling and time partition ranges independently.
Migration Guide
Compatibility Changes Required
While most existing Presto v317 queries are compatible with the new version v350, some queries require changing the syntax, and so on.
Run the New Presto Version
You can specify the Presto version that runs your queries by using the following query hints:
(i) Presto 350 -- @TD engine_version: 350
(ii) Presto 317: -- @TD engine_version: 317 You can request Treasure Data Support to change the default engine version to Presto 350.
You can request Treasure Data Support to change the default engine version to Presto 350.
Next Steps
We plan to start migrating the Presto default version in the near future. The target date will be announced soon, but we recommend you verify your Presto queries and switch to the new version as soon as possible.
We plan to inform a list of potential incompatible Presto queries to the executing users through our simulation when we announce the schedule.
If you have any further inquiries, please reach out to support@treasure-data.com or your Customer Success representative.
Changed Presto Queries
SELECT DISTINCT, ORDER BY
When SELECT DISTINCT is used with ORDER BY statement, expressions must appear in the select list.
The following query works in the current version of Presto.
SELECT DISTINCT from_unixtime(time) , COUNT(*), time FROM sample_datasets.www_access GROUP BY from_unixtime(time), time ORDER BY from_unixtime(time)
This query fails on Presto 350 with an error message like For SELECT DISTINCT, ORDER BY expressions must appear in select list
. To solve this issue on Presto 350, rewrite this query as follows.
SELECT DISTINCT from_unixtime(time) , COUNT(*), time FROM sample_datasets.www_access GROUP BY from_unixtime(time) , time ORDER BY from_unixtime(sample_datasets.www_access.time)
lag() and lead()
ORDER BY
lag() and lead() require ORDER BY in Presto 350. For example, the following query works on the current version, but it doesn’t on Presto 350.
SELECT time, LAG (path, 1) OVER () AS lag_data, path, LEAD (path, 1) OVER () AS lead_data FROM sample_datasets.www_access
To support this query in Presto 350, you must explicitly use the ORDER BY clause as follows.
SELECT time, LAG (path, 1) OVER (ORDER BY time) AS lag_data, path, LEAD (path, 1) OVER (ORDER BY time) AS lead_data FROM sample_datasets.www_access
Window Frame
Presto 350 has a new semantic check for lag() and lead(); the frame cannot be specified. For example, the following query works on the current version but doesn't produce expected results. However, this query fails on Presto 350.
SELECT time, user, LAG (path, 1) OVER (ORDER BY time ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS lag_data, path, LEAD (path, 1) OVER (ORDER BY time ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS lead_data FROM sample_datasets.www_access
To fix this query in Presto 350, remove ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING.
SELECT time, user, LAG (path, 1) OVER (ORDER BY time) AS lag_data, path, LEAD (path, 1) OVER (ORDER BY time) AS lead_data FROM sample_datasets.www_access
TIME and TIMESTAMP Behavior Changes
In Presto 350, TIME and TIMESTAMP behaviors follow SQL standards; it has some incompatibilities with the current version.
Presto has the following data types to represent time and timestamp:
TIME
TIME WITH TIME ZONE
TIMESTAMP
TIMESTAMP WITH TIME ZONE
In the current version, TIME and TIMESTAMP are similar to WITH TIME ZONE types, but in Presto 350, TIME and TIMESTAMP don’t have an associated timezone. This might affect query results.
Examples
The result of the following query will be different depending on the Presto version:
SELECT CAST('2000-01-01 00:00:00 US/Eastern' AS TIMESTAMP) --> 2000-01-01 05:00:00.000 on Presto 317 --> 2000-01-01 00:00:00.000 on Presto 350
When you convert WITH TIME ZONE types to TIME or TIMESTAMP, timezone is considered in the current version, but not in Presto 350. If you have queries that contain such conversion, you might need to revise them for Presto 350. In the previous example, the query can be rewritten as follows:
SELECT CAST( -- Convert to UTC before CAST as TIMESTAMP TIMESTAMP '2000-01-01 00:00:00 US/Eastern' AT TIME ZONE 'UTC' AS TIMESTAMP)
or simply use TIMESTAMP WITH TIME ZONE type instead:
SELECT CAST('2000-01-01 00:00:00 US/Eastern' AS TIMESTAMP WITH TIME ZONE)
In addition, political timezones (e.g. America/Los_Angeles) are no longer allowed in TIME WITH TIME ZONE, to avoid issues around DST and possible future policy changes.
Therefore, the following query works on the current version, but not on Presto 350.
SELECT TIME '01:02:03.456 America/Los_Angeles'
Rewrite this query as follows:
SELECT TIME '01:02:03.456 -07:00'
And the timezone is case-sensitive in Presto 350. For example, Asia/tokyo is invalid and has to be Asia/Tokyo on Presto 350.
Query results can be also different. SELECT CURRENT_TIME generates 09:29:52.540 UTC on the current version; in Presto 350, use 09:29:52.540+00:00.
Query Length Limitation
Presto 350 might generate a longer byte-code internally, so long queries might hit the query length limitation. If you get error messages like the following, you need to shorten the query.
java.lang.IllegalArgumentException: bad parameter count 256
io.prestosql.spi.PrestoException: Query exceeded maximum columns
checksum()
checksum function implementation has changed in Presto 350. The checksum() function generates a different result from previous Presto versions.
approx_percentile()
approx_percentile() now uses T-digest as an internal data structure, which is more accurate and faster than the previous version. The function produces slightly different results.
However, if you specify the accuracy parameter, approx_percentile() doesn’t use T-digest because T-digest doesn’t allow the accuracy parameter. If you want to benefit from T-digest-based approx_percentile(), consider dropping the accuracy parameter.
-- Before SELECT approx_percentile( column1 / column2, -- can be infinite value 0.5 -- percentile ) FROM ... -- Exclude invalid values SELECT approx_percentile( column1 / column2, -- can be infinite value 0.5 -- percentile ) filter (where is_finite(x)) FROM ... -- Use old version SELECT approx_percentile( column1 / column2, -- can be infinite value 1, -- weight 0.5, -- percentile 0.01 -- accuracy ) FROM ...
double NaN to integer
Presto 350 doesn’t allow converting NaN value to INTEGER, however, it can be converted to 0 by CAST in the current version. The following query works on the current version but fails on Presto 350 with Cannot cast double NaN to integer error message.
SELECT CAST(nan() AS INTEGER)
You can restore the original behavior by using TRY_CAST and COALESCE as follows:
SELECT COALESCE(TRY_CAST(nan() AS INTEGER), 0)
information_schema.columns
comment and extra_info columns have been removed from information_schema.columns. If you have queries that refer to these columns, you have to revise them.
Here is a query result of information_schema.columns on Presto 350.
presto> select * from information_schema.columns limit 10; table_catalog | table_schema | table_name | column_name | ordinal_position | column_default | is_nullable | data_type ---------------+--------------------+------------------+-----------------+------------------+----------------+-------------+----------- td-presto | information_schema | tables | table_catalog | 1 | NULL | YES | varchar td-presto | information_schema | tables | table_schema | 2 | NULL | YES | varchar ...
Length check for CHAR(x) and VARCHAR(x)
A length check for CHAR(X) and VARCHAR(x) has been introduced in Presto 350; extra characters are simply truncated from the results, but you might see the following error in some cases.
io.prestosql.spi.PrestoException: Could not serialize column 'col1' of type 'varchar(10)' at position 1:2
You might see this error if your query contains CAST(xxx AS VARCHAR(10)) AS col1 and col1 value is longer than 10. In this case, you need to revise the query to limit the size or without the data size using CAST(xxx AS VARCHAR).
Trailing delimiter in ARRAY_JOIN() result
In the current version, ARRAY_JOIN() appends the delimiter if the last element is NULL, but this is not the case in Presto 350. For example, the following query yields "A,B," on the current Presto version, but "A,B" on Presto 350. The ending comma (,) disappears on Presto 350.
SELECT ARRAY_JOIN(ARRAY ['A', 'B', NULL], ',')
Invalid value error at TRY(CAST(VARCHAR as TIMESTAMP))
The following query yields NULL on Presto 317 because although the timestamp format is valid, 32 is invalid as a value of a day of the month.
SELECT TRY(CAST('2022-01-32 00:00:00' AS TIMESTAMP))
You might see an error like this on Presto 350.
Invalid value for DayOfMonth (valid values 1 - 28/31): 32
Instead, use TRY_CAST instead of the combination of TRY and CAST.
SELECT TRY_CAST('2022-01-32 00:00:00' AS TIMESTAMP)