Hive Performance Tuning
Table of Contents
- Basic knowledge of Treasure Data.
- Basic knowledge of the Hive query language.
Leveraging Time-based Partitioning
All imported data is automatically partitioned into hourly buckets, based on the ‘time’ field within each data record. By specifying the time range to query, you avoid reading unnecessary data and can thus speed up your query significantly.
1) WHERE time <=> Integer
When the ‘time’ field within the WHERE clause is specified, the query parser will automatically detect which partition(s) should be processed. Please note that this auto detection will not work if you specify the time with
float instead of
[GOOD]: SELECT field1, field2, field3 FROM tbl WHERE time > 1349393020 [GOOD]: SELECT field1, field2, field3 FROM tbl WHERE time > 1349393020 + 3600 [GOOD]: SELECT field1, field2, field3 FROM tbl WHERE time > 1349393020 - 3600 [BAD]: SELECT field1, field2, field3 FROM tbl WHERE time > 13493930200 / 10 [BAD]: SELECT field1, field2, field3 FROM tbl WHERE time > 1349393020.00 [BAD]: SELECT field1, field2, field3 FROM tbl WHERE time BETWEEN 1349392000 AND 1349394000
An easier way to slice the data is to use TD_TIME_RANGE UDF.
[GOOD]: SELECT ... WHERE TD_TIME_RANGE(time, "2013-01-01 PDT") [GOOD]: SELECT ... WHERE TD_TIME_RANGE(time, "2013-01-01", NULL, "PDT") [GOOD]: SELECT ... WHERE TD_TIME_RANGE(time, "2013-01-01", TD_TIME_ADD("2013-01-01", "1day", "PDT"))
However, if you use TD_TIME_FORMAT UDF or division in TD_TIME_RANGE, time partition opimization doesn’t work. For instance, the following conditions disable optimization.
[BAD]: SELECT ... WHERE TD_TIME_RANGE(time, TD_TIME_FORMAT(TD_SCHEDULED_TIME(), 'yyyy-MM-dd')) [BAD]: SELECT ... WHERE TD_TIME_RANGE(time, TD_TIME_FORMAT(1356998401, 'yyyy-MM-dd')) [BAD]: SELECT ... WHERE TD_TIME_RANGE(time, TD_SCHEDULED_TIME() / 86400 * 86400)) [BAD]: SELECT ... WHERE TD_TIME_RANGE(time, 1356998401 / 86400 * 86400))
Set Custom Schema
$ td schema:set testdb www_access action:string user:int $ td query -w -d testdb "SELECT user, COUNT(1) AS cnt FROM www_access WHERE action='login' GROUP BY user ORDER BY cnt DESC"
After setting the schema, queries issued with named columns instead of ‘v’ will use the schema information to achieve a more optimized execution path. In particular, GROUP BY performance will improve significantly.
DISTRIBUTE BY…SORT BY v. ORDER BY
In Hive, ORDER BY is not a very fast operation because it forces all the data to go into the same reducer node. By doing this, Hive ensures that the entire dataset is totally ordered.
However, sometimes we do not require total ordering. For example, suppose you have a table called
user_action_table where each row has
time. Your goal is to order them by time per user_id.
If you are doing this with ORDER BY, you would run
SELECT time, user_id, action FROM user_action_table ORDER BY user_id, time
However, you can achieve the same result with
SELECT time, user_id, action FROM user_action_table DISTRIBUTE BY user_id SORT BY time
This is because all the rows belonging to the same user_id go to the same reducer (“DISTRIBUTE BY user_id”) and in each reducer, rows are sorted by time (“SORT BY time”). This is faster than the other query because it uses multiple reducers as opposed to a single reducer.
You can learn more about the differences between ORDER BY and SORT BY here.
Avoid “SELECT count(DISTINCT field) FROM tbl”
This query looks familier to SQL users, but this query is very slow because only one reducer is used to process the request.
SELECT count(DISTINCT field) FROM tbl
So please rewrite the query like below to leverage multiple reducers.
SELECT count(1) FROM ( SELECT DISTINCT field FROM tbl ) t
Considering the Cardinality within GROUP BY
There’s a probability where GROUP BY becomes a little bit faster, by carefully ordering a list of fields within GROUP BY in an order of high cardinality.
good: SELECT GROUP BY uid, gender bad: SELECT GROUP BY gender, uid
Efficient Top-k Query Processing using each_top_k
Efficient processing of Top-k queries is a crucial requirement in many interactive environments that involve massive amounts of data.
Our Hive extension
each_top_k helps running Top-k processing efficiently.
- Suppose the following table as the input
- Then, list top-2 students for each class
The standard way using SQL window function would be as follows:
SELECT student, class, score, rank FROM ( SELECT student, class, score, rank() over (PARTITION BY class ORDER BY score DESC) as rank FROM table ) t WHRE rank <= 2
An alternative and efficient way to compute top-k items using
each_top_k is as follows:
SELECT each_top_k( 2, class, score, class, student -- output columns other in addition to rank and score ) as (rank, score, class, student) FROM ( SELECT * FROM table CLUSTER BY class -- Mandatory for `each_top_k` ) t
|`CLUSTER BY x` is a synonym of `DISTRIBUTE BY x CLASS SORT BY x` and required when using `each_top_k`.|
|`each_top_k` is benefical where the number of grouping keys are large. If the number of grouping keys are not so large (e.g., less than 100), consider using `rank() over` instead.|
The function signature of
each_top_k is follows:
each_top_k(int k, ANY group, double value, arg1, arg2, ..., argN) returns a relation (int rank, double value, arg1, arg2, .., argN).
Any number types or timestamp are accepted for the type of
value but it MUST be not NULL.
Do null hanlding like
if(value is null, -1, value) to avoid null.
k is less than 0, reverse order is used and tail-K records are returned for each
The ranking semantics of
each_top_k follows SQL’s
dense_rank and then limits results by
Please refer Hivemall userguide for further information.
Last modified: Nov 16 2016 07:59:19 UTC