Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Presto uses JSON text to materialize query results. If the preceding table contains 100GB of data, the coordinator transfers more than 100GB of JSON text to save the query result. So, even if the query computation is almost finished, outputting JSON results takes a long time. 

Use CREATE TABLE (table) AS SELECT …

...

Treasure Data uses a column-oriented storage format, so accessing a small set of columns is really fast, but as the number of columns increases in your query, it deteriorates the query performance. Be selective in choosing columns:.

true
Code Block
linenumberstrue
SELECT id, name, address FROM my_table
Excerpt
hidden

Additionally, avoid using the following SELECT statements as they can significantly slow down your query:

  • SELECT + ORDER BY
  • SELECT + upper case
  • SELECT + duplicated column names

Instead, use result_output_direct.

Excerpt

Use result_output_redirect

By setting result_output_redirect='true', you can enhance the performance of a Presto job.

Code Block
linenumberstrue
-- set session result_output_redirect='true'


Note

In the following situations, the query hint dos not work:

  • When a query includes ORDER BY

  • When a query is CREATE TABLE or INSERT TABLE

How Does it Work?

td-worker processes data in a single thread as follows:

  • downloading query results

  • transforming JSON response from the presto coordinator into msgpack.gz format

  • uploading them to S3

Queries that Produce a Large Volume of Query Results

If the query result has 10GB or more data, td-presto is waiting for query result retrieval by td-worker even though the query result is ready. td-worker is a single thread Ruby process and is not designed to read such large query results; currently, td-worker interleaves various data processing in a single thread, such as downloading query results, transforming JSON responses from the presto coordinator into msgpack.gz format, and uploading them to S3. It cannot fully utilize machine resources, therefore, Treasure Data recommends that you redirect the query result to S3 to make increase the performance. 

...