Visit our new documentation site! This documentation page is no longer updated.

Presto Query FAQs

Note: Treasure Data specific tips are marked with

Table of Contents

Q: How can I migrate Hive queries into Presto?

See the migration guide from Hive. Presto basically follows ANSI SQL syntax and semantics, while Hive uses SQL-like syntax.

  • A major difference is quotation. In Presto, use double quotations for column and table names (e.g., “column1”), single quotations for strings (e.g., ‘hello presto’).
  • Difference of implicit type conversion:
    • SELECT CAST(5 AS DOUBLE) / 2 gives 2.5, but SELECT 5 / 2 gives 2 in Presto.
    • In Hive, both returns 2.5.

Q: What functions are available in Presto?

See the list of functions of Presto. For Treasure Data specific UDFs, see Presto UDFs .

Q: My query is not working. How can I fix it?

Solution: Start with a smaller data set, then look into sub queries.

Presto can process more than millions of rows in a second. If you see some query errors or unexpected results, etc., first try to minimize your input data set, since checking many rows is a hard task. Here is a guideline to fix your query:

  1. Narrow down the input data size by using TD_TIME_RANGE TD.
  2. Minimize your query:
    • Remove irrelevant columns from the SELECT statements.
    • Reduce the number of joins.
    • Replace SELECT c1, c2, … with SELECT count(*).
      • If you are still experiencing query errors, skipping the query result genaration with count(*) will reduce your investigation time.
  3. Extract sub-queries and check their results one by one.

    • Running individual sub queries is also useful to identify the performance bottleneck.
  4. Check the conditions in WHERE clause carefully.

    • A common pitfall is checking NULL value using the equal (=) operator. This always returns false. Use IS NULL (or IS NOT NULL).
    • Here is another example in which it is difficult to notice the mistake. The following query scan the entire data set even if TD_TIME_RANGE is used:
SELECT * FROM table1
WHERE
  col1 < 100
  OR col2 is TRUE
  AND TD_TIME_RANGE(time, '2015-11-01')

This is because AND has stroger precedence than OR, so this condition is equivalent to:

(col1 < 100) OR (col2 is TRUE AND TD_TIME_RANGE(time, '2015-11-01'))

The first condition does not specify any TD_TIME_RANGE, so it results in scanning the whole table. To fix this, use parenthesis appropriately:

(col1 < 100 OR col2 is TRUE) AND TD_TIME_RANGE(time, '2015-11-01')

Solution: Create a sample data set with VALUES clause.

To test Presto queries, you don’t need to import any data set. Just use the VALUES clause to prepare a sample data set:

SELECT * FROM (VALUES (1, 'apple'), (2, 'banana')) as fruit(id, name);

# This query gives the following table:
#  id |  name
#-----+--------
#   1 | apple
#   2 | banana

Solution: Ask Our Support TeamTD

(For Treasure Data customers) If you still see any problem, ask our support. Our support engineers are experts of SQL and various types of data analysis. Send an e-mail to support@treasuredata.com with job ID(s) of your queries. Information of the expected results and the meaning of your data set is helpful for us to give you a better answer.

Q: Query was failed with PAGE_TRANSPORT_TIMEOUT error. What does it mean?

See the answer below.

Q: How reliable is Presto?

Presto transfers the data through the network. Due to connection timeout or some problem in worker nodes, this network data transfer may fail occasionally (PAGE_TRANSPORT_TIMEOUT, WORKER_RESTARTED, etc.). Presto is designed to have faster query processing compared to Hive, so it sacrifices fault-tolerance to some extent. In reality, however, more than 99.5% of Presto queries finish without any error on the first run. In addition, Treasure Data provides a query retry mechanism on query failures, so nearly 100% of queries finish successfully after being retried.

Note that, however, because of the nature of network-based distributed query processing, if your query tries to process billions of records, the chance of hitting network failures will increase. If you see PAGE_TRANSPORT_TIMEOUT error frequently, try to reduce the input data size by narrowing down the TD_TIME_RANGE or reducing the number of columns in SELECT statements.

Q: My query is slow. How can I improve it?

Solution: Avoid single-node operations.

Presto is a distributed query engine, but some operators need to be processed in a single process. For example,

  • count(distinct x) — check the number of unique values.
  • UNION — check the uniqueness of the tuples by comparing all column values.
    • For most of the cases UNION can be replaced with UNION ALL (no duplicate elimination).
  • ORDER BY — Presto sorts the entire data set in a single node.
    • If sorting more than millions of rows with ORDER BY is necessary, you can combine Hive and Presto.

Solution: Narrow down the table scan range with TD_TIME_RANGE. TD

Treasure Data is a time-series database, and creates 1-hour partitioned buckets. So focusing on the specific time range of the data will give the best query performance. TD_TIME_RANGE is your friend:

SELECT ... FROM ...
WHERE TD_TIME_RANGE(time, '2015-10-01 PST', '2015-11-01 PST')

Solution: Avoid using SELECT *, then use CREATE TABLE AS or INSERT INTO

See Q: Query that produces a huge result is slow

Solution: Aggregate a series of LIKE clauses into one regexp_like expression.

Presto’s query optimizer is unable to improve queries where many LIKE clauses are used. As a consequence the query execution can be slower than expected. To improve the performance, you can substitute a series of LIKE clauses chainied in an OR with a single regexp_like clause. For example:

SELECT
  ...
FROM
  access
WHERE
  method LIKE '%GET%' OR
  method LIKE '%POST%' OR
  method LIKE '%PUT%' OR
  method LIKE '%DELETE%'

can be optimized by replacing the 4 LIKE clauses with a single regexp_like function:

SELECT
  ...
FROM
  access
WHERE
  regexp_like(method, 'GET|POST|PUT|DELETE')

Solution: Check 1-hour data distribution TD

If you have a huge number of rows in an 1-hour partition, processing this partition can be the performance bottleneck. To check the number rows contained in each partition, run the following query:

SELECT TD_TIME_RANGE(time, 'yyyy-MM-dd HH') hour, count(*) cnt
FROM table1
WHERE TD_TIME_RANGE(time, '2015-10-01 UTC', '2015-11-01 UTC')
GROUP BY 1
ORDER BY cnt DESC
LIMIT 100

This query shows the top 100 partitions that contain the highest number of records during 2015-10 to 2015-11.

Q: How can I improve the join performance?

Solution: Use simple equi-joins.

The equi-join concatenates tables by comparing join keys using the equal (=) operator. If this comparison becomes complex, the join processing will slow down. For example, if you want to join two tables with date string e.g., ‘2015-10-01’, but one of the tables only has columns for year, month, day values, you may write the following query to generate date strings:

SELECT a.date, b.name FROM
left_table a
JOIN right_table b
ON a.date = CAST((b.year * 10000 + b.month * 100 + b.day) as VARCHAR)

This query will delay the join processing since the join condition involves several expressions. You can make it faster by pushing down this condition into a sub query to prepare a join key beforehand:

SELECT a.date, b.name FROM
left_table a
JOIN (
  SELECT
    CAST((b.year * 10000 + b.month * 100 + b.day) as VARCHAR) date,  # generate join key
    name
  FROM right_table
) b
ON a.date = b.date  # Simple equi-join

In this example, join keys are a.date and b.datestr columns. Comparing VARCHAR strings is much faster than comparing VARCHAR and expression result. In future Presto would be able to optimize this type of queries, but for now, you need to rewrite the query by hand.

Q: How can I simplify the query?

Solution: Use WITH statement.

If your query becomes complex or deeply nested, try to extract sub queries using WITH clause. For example, the following query that has a nested sub query:

SELECT a, b, c FROM (
   SELECT a, MAX(b) AS b, MIN(c) AS c FROM tbl GROUP BY a
) tbl_alias

can be rewritten as follows:

WITH tbl_alias AS (SELECT a, MAX(b) AS b, MIN(c) AS c FROM tbl GROUP BY a)
SELECT a, b, c FROM tbl_alias

You can also enumerate multiple sub-queries in WITH clause, by using comma:

WITH tbl1 AS (SELECT a, MAX(b) AS b, MIN(c) AS c FROM tbl GROUP BY a),
     tbl2 AS (SELECT a, AVG(d) AS d FROM another_tbl GROUP BY a)
SELECT tbl1.*, tbl2.* FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a

Solution: Use WITH statement in CREATE TABLE statement.

If your CREATE TABLE query becomes complex or deeply nested, try to extract sub queries using WITH clause. For example, one sub query can be rewritten as follow:

CREATE TABLE tbl_new AS WITH tbl_alias AS (SELECT a, MAX(b) AS b, MIN(c) AS c FROM tbl1)
SELECT a, b, c FROM tbl_alias

You can also enumerate multiple sub-queries in WITH clause as follow:

CREATE TABLE tbl_new AS WITH tbl_alias1 AS (SELECT a, MAX(b) AS b, MIN(c) AS c FROM tbl1),
                             tbl_alias2 AS (SELECT a, AVG(d) AS d FROM tbl2)
SELECT tbl_alias1.*, tbl2_alias.* FROM tbl_alias1 JOIN tbl_alias2 ON tbl_alias1.a = tbl_alias2.a

Solution: Specify GROUP BY targets with numbers.

Group by clause requires to have the same expression in SELECT statement:

SELECT TD_TIME_FORMAT(time, 'yyyy-MM-dd HH', 'PDT') hour, count(*) cnt
FROM my_table
GROUP BY TD_TIME_FORMAT(time, 'yyyy-MM-dd HH', 'PDT')  # <-- redundant expression

You can simplify this query by using GROUP BY 1, 2, …:

SELECT TD_TIME_FORMAT(time, 'yyyy-MM-dd HH', 'PDT') hour, count(*) cnt
FROM my_table
GROUP BY 1

These numbers correspond to the column indexes (1-origin) of the SELECT statement.

Q: Exceeded max (local) memory xxGB error.

Answer: Rewrite your SQL to use less memory.

Presto tracks the memory usage of each query. While the available memory varies according to your price plan, in most cases it is possible to rewrite your query to resolve this issue. Here is a list of memory-intensive operations:

  • distinct
  • UNION
  • ORDER BY
  • GROUP BY (of many columns)
  • joins

Solution: Avoid using distinct operator.

distinct eliminates all non unique rows. For example, the following query checks whether your table contains the same set of (c1, c2, c3) tuples:

SELECT distinct c1, c2, c3 FROM my_table

This stores the entire set of columns c1, c2 and c3 into a memory of a single worker node to check the uniqueness of the tuples. The amount of the required memory increases with the number of columns and their size. Remove distinct from your query or use it after reducing the number of input rows by using a sub query.

Solution: Replace count(distinct x) with approx_distinct(x).

If you are counting the number of distinct users, events, etc. you usually want to use count(distinct id) operation. But this can cause memory issues:

SELECT count(distinct id) FROM my_table

To reduce the memory usage, use an approximate version of count(distinct x):

SELECT approx_distinct(id) FROM my_table

approx_distinct(x) returns an approximate result of the true value. It ensures that returning a far distant value happens only in a quite low probability. If you simply need to glance the characteristics of your data set, use this approximate version.

Solution: Prefer UNION ALL over UNION.

For the same reason with distinct, UNION of SQL queries performs duplicate elimination and requires a substantial amount of memory:

SELECT c1, c2, c3 FROM my_table1
UNION
SELECT c1, c2, c3 FROM my_table2

If you use UNION ALL, you can avoid the duplicate elimination:

SELECT c1, c2, c3 FROM my_table1
UNION ALL
SELECT c1, c2, c3 FROM my_table2

This requires less memory and is faster. If you need to concatenate two or more SQL query results, use UNION ALL.

Solution: Avoid ORDER BY.

SELECT c1, c2 FROM my_table ORDER BY c1

Presto performs sorting using a single worker node. So the entire data set must fit within the memory limit of a worker (usually less than 5GB).

If you are sorting a small number of rows (e.g., ~ 10,000 rows), using ORDER By is fine, but if you are going to sort GBs of data set, you might need to find an alternative strategy: if sorting the entire data set is necessary, you can combine the use of Hive and Presto.

First store the results of your Presto query by using CREATE TABLE AS or INSERT INTO query, then use Hive to sort the data set.

Solution: Reduce the number of GROUP BY columns.

SELECT avg(c1), min_by(c2, time), max(c3), count(c4), ...
FROM my_table
GROUP BY c1, c2, c3, c4, ...

If you enumerate many target columns in a GROUP BY clause, storing the set of keys consisting of (c1, c2, c3, c4, …) will require a lot of memory. Reducing the number of columns in GROUP BY clause will reduce the memory usage.

Solution: Reduce the size of GROUP BY keys with smart_digest(). TD

We provide smart_digest(key) UDF for Treasure Data customers, which creates smaller hash values to reduce the size of keys.

SELECT smart_digest(path), count(*)
FROM www_access
GROUP BY smart_digest(path)

Solution: Join from the largest table to smaller tables.

The following type of queries that starts with a small table in a join clause usually causes Presto to run against its memory limits:

SELECT * FROM small_table, large_table
WHERE small_table.id = large_table.id

Presto performs broadcast join by default, which paritions the left-hand side table into several worker nodes, then sends the entire copy of the right-hand side table to the worker nodes that have a partition. If the right-hand side table is large and doesn’t fit in memory on a worker node, it will cause an error.

Reordering the table join putting the largest table first will solve the issue:

SELECT * FROM large_table, small_table
WHERE large_table.id = small_table.id

This query distributes the left table (large_table), greatly reducing the chance of hitting the memory limit.

Solution: Use distributed joins. TD

If your query still doesn’t work, try distributed join by adding a magic comment that sets a session property:

-- set session distributed_join = 'true'
SELECT * FROM large_table, small_table
WHERE small_table.id = large_table.id

The distributed join algorithms paritions both left and right-hand side tables by using hash values of the join key(s) as a paritioning key. So it would work even if the right-hand side table is large. A downside is it increases the number of network data transfers and is usually slower than the broadcast join.

Q: Query that produces a huge result is slow.

Answer: Consider using CREATE TABLE AS or INSERT INTO.

SELECT * FROM my_table

Presto uses JSON text for materializing query results. If the above table contains 100GB of data, the coordinator transfers more than 100GB of JSON text to save the query result. So, even if the query computation is almost finished, outputting JSON results takes long time.

Solution: Use CREATE TABLE (table) AS SELECT … TD

DROP TABLE IF EXISTS my_result;
CREATE TABLE my_result AS SELECT * FROM my_table;

You can parallelize the query result output process by using CREATE TABLE AS SELECT statement. To properly clean up the result table beforehand, it would be better to add DROP TABLE statement at the top of your query. The result output performance will be 5x ~ more faster than running SELECT *. Our Presto skips the JSON output process and directly produces 1-hour partitioned table.

Solution: Use INSERT INTO (table) SELECT … TD

You can also use INSERT INTO (table) AS SELECT … to append the data to an existing table. It also improves the query result output performance:

CREATE TABLE IF NOT EXISTS my_result(time bigint);
INSERT INTO my_result SELECT * FROM my_table;

Note that if the subsequent SELECT statement does not produce the time column, INSERT INTO attaches the query execution time as time columm values. So you can find the inserted rows by using the query execution time.

Solution: Avoid using SELECT *.

SELECT * FROM my_table

Treasure Data uses a column-orinented storage format, so accessing a small set of columns is really fast, but as the number of columns increases in your query, it deteriorates the query performance. Be selective in choosing columns:

SELECT id, name, address FROM my_table

Solution: [Experimental] Use result_output_redirect TD

(This is an experimental feature that is subject to change in future) By setting result_output_redirect='true' within a magic comment, you can make faster the query output:

-- set session result_output_redirect='true'
select * FROM my_table

With this query hint our Presto produces the query results in parallel and can skip the JSON conversion process at the coordinator.

But if order by clause or an upper case field name is used at the query, this magic comment will be ignored.

Q: How do I report daily statistics into the columns in a row?

Sometimes, you may need to report access counts of different date ranges into the columns in a row. For example:

SELECT id, d1.total AS day1_total, d2.total AS day2_total
FROM (
 SELECT id, count(*) total
 FROM my_table
 WHERE TD_TIME_RANGE(time,
        TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'),
        TD_SCHEDULED_TIME(), 'UTC')
 GROUP BY id
) d1
LEFT JOIN (
 SELECT id, count(*) total
 FROM my_table
 WHERE TD_TIME_RANGE(time,
        TD_TIME_ADD(TD_SCHEDULED_TIME(), '-2d'),
        TD_TIME_ADD(TD_SCHEUDLED_TIME(), '-1d'), 'UTC')
 GROUP BY id
) d2

The result of this query would be:

id day1_total day2_total
1 10 13
2 14 3

However if your input table is huge, this query becomes inefficient since it involves joins and also scans the same table multiple times.

Solution: Create a sparse table.

A more efficient approach is, instead of using joins, creating a sparse table as follows in a single table scan:

SELECT id,
       CASE diff WHEN 0 THEN 1 ELSE 0 END AS day1,
       CASE diff WHEN 1 THEN 1 ELSE 0 END AS day2
FROM (
  SELECT id,
  date_diff('day',
            date_trunc('day', from_unixtime(time)),
            date_trunc('day', from_unixtime(TD_SCHEUDLED_TIME()))) AS diff
  FROM my_table
  WHERE TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEUDLED_TIME(), '-2d'), TD_SCHEDULED_TIME(), 'UTC')
)
id day1 day2
1 1 0
2 1 0
1 1 0
1 0 1
2 0 1
2 0 1

Then, aggregate the result:

SELECT id,
       sum(day1) AS day1_total,
       sum(day2) AS day2_total
FROM sparse_table
GROUP BY id

For readability, you can write these steps in a single job by using WITH statement:

WITH
-- Compute the date differencess of the events from the TD_SCHEDULED_TIME
date_diff_table AS (
  SELECT id,
  date_diff('day',
            date_trunc('day', from_unixtime(time)),
            date_trunc('day', from_unixtime(TD_SCHEDULED_TIME()))) AS diff
  FROM my_table
  WHERE TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-2d'), TD_SCHEDULED_TIME(), 'UTC')
),
-- Create a sparse table, which maps daily event counts into day1, day2, ...
sparse_table AS (
  SELECT id,
         CASE diff WHEN 0 THEN 1 ELSE 0 END AS day1,
         CASE diff WHEN 1 THEN 1 ELSE 0 END AS day2
  FROM date_diff_table
)
-- Aggregate the sparse table
SELECT id,
       sum(day1) AS day1_total,
       sum(day2) AS day2_total
FROM sparse_table
GROUP BY id

Q: How can I compute median in Presto?

Solution: Use approx_percentile(column, 0.5).

Presto does not provide median function, which returns the middle value of the sorted list. Instead you can use approx_percentile function:

SELECT approx_percentile(price, 0.5) FROM nasdaq

Q: How can I clean-up a table before updating?

Solution: Use multiple statements.

Presto supports multiple statement execution, separated by semi-colon(;):

DROP TABLE IF EXISTS my_table;
CREATE TABLE my_table AS SELECT ...;

Note that multiple statement does not support transaction. See also limitation.

Q: How can I concatenate strings?

Solution: Use || operator.

SELECT 'hello ' || 'presto'

This returns string ‘hello presto’.

Q: How can I supply a default value if the column contains NULLs?

Solution: Use COALESCE function.

COALESCE(v1, v2, ..) gives the first non-null values from v1, v2, …:

-- This retuns 'N/A' if name value is null
SELECT COALESCE(name, 'N/A') FROM table1

Q: How can I take only a max/min value of two values?

Use greatest (or least) function.

SELECT greatest(5, 10) -- returns 10

Q: How can I parse JSON data?

Solution: Use json_extract function.

Presto has several functions for reading JSON strings. To extract a json parameter, you can use json_extract_scalar function.

-- This returns 1 (bigint)
SELECT CAST(json_extract_scalar('{"id":1, "name":"Ann"}', '$.id') AS BIGINT) id

Q: How can I parse JSON array data?

Solution: Use json_parse + CAST to ARRAY

For parsing nested JSON data, you can combine json_parse and cast functions:

SELECT * FROM (
  SELECT CAST(json_parse('[1, 2, 3]') AS ARRAY<BIGINT>) items
) CROSS JOIN UNNEST(items) AS t(item);     

This produces:

items item
[1, 2, 3] 1
[1, 2, 3] 2
[1, 2, 3] 3

You can also parse an array of JSON objects:

WITH records AS (
  SELECT CAST(json_parse('[{"id":1,"name":"Ann"},{"id":2, "name":"Bob"}]') 
    AS ARRAY<JSON>) records
)
SELECT record FROM records CROSS JOIN UNNEST(records) AS t(record);

This query returns each JSON object in a separate row:

record
{"id":1,"name":"Ann"}
{"id":2,"name":"Bob"}

Then you can use json_extract_scalar to read the parameters of each JSON object:

WITH records AS (
  SELECT CAST(json_parse('[{"id":1,"name":"Ann"},{"id":2, "name":"Bob"}]') AS ARRAY<JSON>) records
)
SELECT
   CAST(json_extract_scalar(record, '$.id') AS BIGINT) id,
   CAST(json_extract_scalar(record, '$.name') AS VARCHAR) name
FROM records CROSS JOIN UNNEST(records) AS t(record);
id name
1 Ann
2 Bob

Q: How can I specify JSON keys with white spaces or special symbols?

Solution: Use JSON paths with square brackets and dobule quotation.

SELECT json_extract('{"a b":1}', '$["a b"]');

This returns 1.

Q: Query was failed with “Compiler failed and interpreter is disabled” error. What does it mean?

The cause is too many OR and AND conditions, more than 1000 for a job, in WHERE can cause the execution byte code generation failure. There’s no other solution than reducing number of condition through UNION or an intermediate table.

Q: How can I use Binary function?

Presto supports binary functions. In order to use binary function on TreasureData, you need to convert binary/varbinary types to varchar type.

Solution: Convert to SHA256

SELECT to_hex(sha256(to_utf8('support@treasure-data.com'))) as email
=> "D1F9E365DED3996161145F1FE41A1BE56365E314795DFB07F7B930A6BFF7248E"

   

Solution: Covert from/to base64

SELECT to_base64(to_utf8('support@treasure-data.com')) as email
=> "c3VwcG9ydEB0cmVhc3VyZS1kYXRhLmNvbQ=="
SELECT FROM_UTF8(from_base64('c3VwcG9ydEB0cmVhc3VyZS1kYXRhLmNvbQ=='))
=> "support@treasure-data.com"

Q: How do I use Presto geo-spatial functions?

Solution: How to store geo-spatial data.

The building blocks of geo-spatial functions is the Geometry data type. Geometry data types are expressed using the WKT format. Well-known text (WKT) is a text markup language for representing vector geometry objects on a map. Examples include: POINT (0 0) LINESTRING (0 0, 1 1, 1 2) POLYGON ((0 0, 4 0, 4 4, 0 4, 0 0), (1 1, 2 1, 2 2, 1 2, 1 1)) MULTIPOINT (0 0, 1 2) MULTILINESTRING ((0 0, 1 1, 1 2), (2 3, 3 2, 5 4)) MULTIPOLYGON (((0 0, 4 0, 4 4, 0 4, 0 0), (1 1, 2 1, 2 2, 1 2, 1 1)), ((-1 -1, -1 -2, -2 -2, -2 -1, -1 -1))) GEOMETRYCOLLECTION (POINT(2 3), LINESTRING (2 3, 3 4))

As WKT is a text format, Geometry datatypes are stores as VARCHAR or STRING datatypes. (Excluding the Point data type which can be store as a DOUBLE representing the latitude and longtitude.). E.g. to store a polygon in a table you would store the WKT presentation of a polygon. The actual text content that would be stored in TD table would be: ‘POLYGON ((0 0, 4 0, 4 4, 0 4, 0 0), (1 1, 2 1, 2 2, 1 2, 1 1))’. If you needed to store a multipoint line the actual text you would store in the TD table would be: ‘MULTIPOINT (0 0, 1 2)’

Solution: How to convert string data to Geometry type?

All presto-geo spatial functions require the Geometry data type. As mentioned in the previous solution on how to store geo-spatial data, the data is stored in TD as a WKT formatted string. As a result, the WKT string data has to be converted into a Geometry type before it can be used in geospatial functions. The conversion from WKT string data to Geometry type is done using geo-spatial constructor functions. These functions accept WKT formatted text as arguments and return the Geometry data type required by geospatial functions. Examples of constructor functions include:

ST_Polygon(varchar) → Polygon Actual example: ST_POLYGON(POLYGON ((0 0, 4 0, 4 4, 0 4, 0 0), (1 1, 2 1, 2 2, 1 2, 1 1)) Will accept the WKT text representation and return a polygon geometry data type.

ST_GeometryFromText(varchar) → Geometry Actual Example: ST_GeometryFromText(MULTILINESTRING ((0 0, 1 1, 1 2), (2 3, 3 2, 5 4))) Will accept the WKT text representation and return a multiline string geometry data type.

Solution: How to run geospatial queries in Presto.

To bring it all together using a simple example. Let’s assume you had database with two tables: Cities & Trips. The columns of the Cities table would be city and area. The city column would be a String e.g. Dallas and the area column would be a string WKT representation of the Dallas area e.g: polygon ((0.3528644 32.5869589, 0.3587434,32.5963149, 0.3353124,32.6015929,0.3528644,32.5869589)) . Your second table, Trips would be a list of bus rides taken. The columns would be username , a string of a user’s name e.g “Peter” and the journey would would be a WKT string representation of a linestring for each trip taken: LINESTRING (0 0, 1 1, 1 2).

Given that setup, let as assume you wanted to find how many past through certain cities. The geo spatial function to find that out would be:

ST_Intersects(Geometry, Geometry) → boolean

An example query would be:

SELECT c.city AS City, count(*) as NumTrips
FROM trips as t
JOIN cities as c
on  ST_Intersects(ST_Polygon(c.area), ST_LineFromText(t.journey))
GROUP BY 1

The output would be:

City NumTrips
Ames 10
New York 11
Dallas 3
Vegas 44

Solution: How to calculate distance in kilometers between two point using latitude and longtitude.

Due to current implementation of geo-spatial functions in Presto, ST_DISTANCE returns a spatial-ref that at times can impractical in use cases where you need the distance between two points in kilometers. In cases where you need to calculate distance in kilometers you have to use an SQL query. Assuming you had two locations: A(lat1, long1), B(lat1, long2) The SQL query to find the distance between the two points in kilometers:

select 6371 * ACOS(COS(RADIANS(lat2)) * COS(RADIANS(lat1)) 
* COS(RADIANS(long1) - RADIANS(long2)) 
+ SIN(RADIANS(long1))* SI(RADIANS(lat1)));

Note: 6371 is the diameter of the earth.

The result would in the distance between the two points in kilometers e.g: 40.45

Q: How do I access TD Table metadata using Presto?

Solution: You may query our internal database information_schema

# List TD Databases
SELECT * from information_schema.schemata

# List TD Tables
SELECT * from information_schema.tables

# List all column metadata
SELECT * from information_schema.columns

Optionally, you may use Presto SHOW .. commands to see similar information as above:

# List Databases
SHOW SCHEMAS

# List Tables
SHOW TABLES FROM <database_name>

# List column metadata for a specific table
SHOW COLUMNS FROM <database_name>.<table_name>

Last modified: Feb 23 2018 23:16:55 UTC

If this article is incorrect or outdated, or omits critical information, let us know. For all other issues, access our support channels.