Most queries from Hive 0.13 should work with minimal or no modification in Hive 2. However, Hive 2 is stricter than Hive 0.13.

Code Compatibility

Element

Hive 0.13

Hive 2

SQL syntax

HiveQL

(A MySQL-like dialect)

ANSI SQL syntax

Reserved words


More reserved words than HiveQL. Quote reserve words with backticks.

Numeric data types


Some values must be CAST explicitly.

Treasure Data specific syntax


Obsolete. Remove syntax from queries.

Syntax Updates

The following Hive 0.13 constructs must be modified to be ready for Hive 2.

Stricter Syntax

Stricter Syntax: Additional Reserved Words and Function Names Require Escapes

Reserved words and UDF names used as column names must be escaped with backticks.

The most common case of this is `date`.

Example

date -> `date`

Reserved words manifest as parse errors in Hive 2. Parse error issues indicate that you need to find the keyword in the script and put backticks around it.

Example

Parse errors received from Hive 2.

Cannot recognize input near 'date' '>=' '1548403200' in expression specification
Cannot recognize input near 'timestamp' '=' '1548835200' in expression specification
Cannot recognize input near 'AS' 'date' ',' in selection target

To fix this, given the current query syntax of the following query.

SELECT queue_id
, cs_id
, SUM(refunds) AS refunds_qtd
, SUM(x_refunds) AS x_refunds_qtd

FROM daily_cs_stats
WHERE date >= 1546329600ANDdate < 1547884800 

Rewrite the query. Notice that the date is surrounded by left-single quotation marks.

SELECT queue_id
, cs_id
, SUM(refunds) AS refunds_qtd
, SUM(x_refunds) AS x_refunds_qtd

FROM daily_cs_stats
WHERE `date` >= 1546329600AND`date` < 1547884800

For more information, review the TD Hive UDF documentation. Additionally, you might want to review the additional UDFs in HiveMall UDF documentation.

Stricter Syntax: In SELECT, Use “AS” with Aliases

In testing some SELECT statements you might encounter a semantic error such as:

FAILED: SemanticException Line 0:-1 Invalid table alias or column reference 'sq_1': (possible column names are…

This can happen when Hive 2 is unable to unambiguously determine whether an identifier is a column alias or the name of a real column. For example, the following query produced an error.

SELECT time,
replace(replace(k_uuid,'%2B','+'),'/','/') k_uuid, category_id, ao_category_id
  FROM X 
 WHERE EXISTS(SELECT * FROM A WHERE A.mtime = X.time AND A.x_uuid = X.x_uuid) 
GROUP BY 1,2,3,4

The following error was thrown:

FAILED: SemanticException Line 0:-1 Invalid table alias or column reference 'sq_1': (possible column names are: _table_or_col a) x_uuid) sq_corr_1)), (. (tok_table_or_col sq_1) sq_corr_1))

The problem is that x_uuid is the name of a source column in the source table X, as well as the desired column alias in the SELECT.

The fix is to use an AS to make it unambiguous that the identifier is an alias.

SELECT time,
 replace(replace(x_uuid,'%2B','+'),'/','/') AS x_uuid, category_id,ao_category_id
  FROM X 
 WHERE EXISTS(SELECT * FROM A WHERE A.mtime = X.time AND A.x_uuid = X.x_uuid) 
GROUP BY 1,2,3,4

Alternatively, you can avoid using the name of an existing column as an alias.

SELECT time,
replace(replace(x_uuid,'%2B','+'),'/','/') x_uuid_out, category_id,ao_category_id
  FROM X 
 WHERE EXISTS(SELECT * FROM A WHERE A.mtime = X.time AND A.x_uuid = X.x_uuid)
GROUP BY 1,2,3,4

Stricter Syntax: Lack GROUP BY Key

GROUP BY clause in Hive2 is stricter than Hive 0.13.

If the name of column in GROUP BY does not match the name of column in SELECT, Hive 2 will fail as the error, Expression not in GROUP BY key 'column_name'.

For example, the following query will fail because the name, column1 in GROUP BY clause does not match the name, 'test_table.column1 in SELECT.

SELECT test_table.column1,
       COUNT(*) AS cnt
 FROM (SELECT test_table.column1
         FROM test_table
      ) AS tab_a
GROUP BY column1
;

Since the syntax of the above query statement is wrong to begin with, please modify it as follows.

SELECT tab_a.column1,
       COUNT(*) AS cnt
 FROM (SELECT test_table.column1
         FROM test_table
      ) AS tab_a
GROUP BY column1
;


Stricter Syntax: Ambiguous Column Reference

If the query attempts to extract the specific column 2 or more times, Hive 2 checks the syntax stricter than Hive 0.13, the query will fail as the error, Ambiguous column reference <column_name> in <table_name>.

For example, the following query extracts the columns named id twice. In this case, Hive 0.13 will succeed, but Hive 2020.1 will be the error.

SELECT …
  FROM (SELECT a.id, -- 1st 
               a.name,
               a.id, -- 2nd
        ...
          FROM table_a as a
        ...
        )


Since the second and subsequent ones are unnecessary, please delete them as follows.

SELECT ...
  FROM (SELECT a.id, -- 1st 
               a.name,
        ...
          FROM table_a as a
        ...
       )


Stricter Syntax: Require Type Conversion

Hive 2 check the data type of argument for functions stricter.


For example, LENGTH function expects string type as data type of argument.

In the case of Hive 0.13, even if the data type of argument is NOT string, the following query will work.

SELECT LENGTH(col1) AS col1,
...

Hive 2 will fail as the error, LENGTH() only takes STRING/CHAR/VARCHAR/BINARY types as first argument, got LONG.

Therefore, please convert the data type of argument from unsupported data type to supported data type like string type as follows.

SELECT LENGTH(CAST(col1 AS string)) AS col1,
...


Stricter Syntax: Forbid ORDER BY or LIMIT in UNION Table

Hive 2 does not accept ORDER BY clause and LIMIT n in query statements against UNION ALL operator.

For example, the following query will fail in the case of Hive 2.

SELECT ...
  FROM table_a
LIMIT 10
UNION ALL
SELECT ...
  FROM table_b
LIMIT 20
;


Since ORDER BY clause does not work as you expected firstly, please remove the part.

In the case of LIMIT n, you can avoid the error by using subquery or WITH clause as follows.

WITH dataset AS (
SELECT ...
  FROM table_a
LIMIT 10
)
SELECT ...
  FROM dataset
UNION ALL
SELECT ...
  FROM (SELECT ...
          FROM table_b
        LIMIT 20) as b
;


Stricter Types

Stricter Types: CAST Non-Integer Numeric Literals to DOUBLE Before Storage in TD

Numeric literals that are not integers, default to DECIMAL in Hive 2. However, TD storage only supports DOUBLE. Therefore you must CAST any numeric value to DOUBLE before storing it in a table. 

51.5211 -> CAST(51.5211 AS DOUBLE)


Without the casting, you may see an error similar to this: 

FAILED: SemanticException java.lang.IllegalArgumentException: Failed to cast hive type : decimal(6,4) (field : col1dc) to TD type, only following hive types can be casted to TD type : tinyint, smallint, int, bigint, float, double, string, binary, array, map, struct, uniontype


Here is an example that works with Hive 0.13, but fails with Hive 2 if not type cast. 

WITH sourcecte as (
select 51.5211 as col1decimal
)
insert into table new_table_with_no_columns
select col1decimal from sourcecte


Assume that new_table_with_no_columns does not exist when this statement is run. TD assigns new_table_with_no_columns a schema based on the schema of sourceCTE. The constant 51.5211 has type DECIMAL(6,4), so sourceCTE.col1decimal is of type DECIMAL(6,4). 

The INSERT tries to insert the DECIMAL(6,4) into the output table. In Hive 0.13, the value is automatically CAST as needed to DOUBLE and written. In Hive 2, this CAST is not automatic, so you get the error.

This requirement only applies to literal numeric values; if you attempt to store an INTEGER column, return value from a function, etc. into a TD table, TD Hive casts the value appropriately.

Stricter Types: All Numeric Literals in an Array Must be CAST to the Same Data Type

All numeric literals in an array must be CAST to a consistent data type.

Here is an example that works with Hive 0.13 and gives an error with Hive 2.

SELECT
CAST(tuple[0] AS BIGINT) as offset,
tuple[1] as increase_weight,
tuple[2] as decrease_weight
FROM
(
SELECT
Array(
Array(0, 0.0, 1.0),
Array(1, 0.019675926, 0.9803323),
Array(2, 0.074074074, 0.925925926),
Array(3, 0.15625, 0.84375),
Array(4, 0.259259259, 0.740740741),
Array(5, 0.376157407, 0.623842593),
Array(6, 0.5, 0.5)
) as tuples
) t
LATERAL VIEW explode(tuples) exploded AS tuple


Can produce an error similar to this:

FAILED: SemanticException [Error 10016]: Line 10:0 Argument type mismatch '0.9022: Argument type "array<decimal(10,4)>" is different from preceding arguments. Previous type was "array<decimal(10,0)>"


To fix the error, CAST all values to decimal(19,9) (19 digits with 9 digits after the decimal).

SELECT
CAST(tuple[0] AS BIGINT) as offset,
tuple[1] as increase_weight,
tuple[2] as decrease_weight
FROM
(
SELECT
Array(
Array(0, cast(0.0 as decimal(19,9)), cast(1.0 as decimal(19,9))),
Array(1, cast(0.019675926 as decimal(19,9)),
cast(0.9803323 as decimal(19,9))),
Array(2, cast(0.074074074 as decimal(19,9)),
cast(0.925925926 as decimal(19,9))),
Array(3, cast(0.15625 as decimal(19,9)),
cast(0.84375 as decimal(19,9))),
Array(4, cast(0.259259259 as decimal(19,9)),
cast(0.740740741 as decimal(19,9))),
Array(5, cast(0.376157407 as decimal(19,9)),
cast(0.623842593 as decimal(19,9))),
Array(6, cast(0.5 as decimal(19,9)),
cast(0.5 as decimal(19,9))) )
as tuples
) t
LATERAL VIEW explode(tuples) exploded AS tuple



Miscellaneous

CROSS JOIN requires Query Hint

As a precaution, TD Hive 2 restricts the use of CROSS JOINs because they are often coding errors and are expensive to execute. If you intend to perform a CROSS JOIN, apply the following query hint to bypass this protection: 

- TD enable_cartesian_product: true


This is similar to setting hive.mapred.mode to nonstrict in a generic Hive environment.

Explicit CROSS JOIN example:

SELECT * FROM employee CROSS JOIN department

Implicit CROSS JOIN example (a JOIN without a WHERE clause):

SELECT * FROM employee, department


If you run a query with an explicit or implicit CROSS JOIN without adding the query hint, you’ll see an error like this:

FAILED: SemanticException Cartesian products are disabled for safety reasons. If you know what you are doing, please set hive.strict.checks.cartesian.product to false and that hive.mapred.mode is not set to 'strict' to proceed. Note that if you may get errors or incorrect results if you make a mistake while using some of the unsafe features.

TD Hive 0.13 Obsolete Feature: v-column Syntax

v-column syntax is an obsolete TD-specific method of referring to columns in TD tables. It has been deprecated for years, but still occasionally appears in some old queries. It is not supported in Hive 2.

v-column syntax is still useful in certain TD debugging scenarios, examining ingested data before schema is assigned to it. In such scenarios, we recommend the use of Hive 0.13 during debugging.

The following changes are required for any query using v column[] syntax:

v column values are always returned as STRING type

In Hive 2, rather than use the v column syntax, reference the column, but explicitly CAST it to STRING, or whatever other type is needed. For example:

v['country_code'] --> cast(country_code as STRING)

cast(v['age'] as int) + 1 --> cast(age as int) + 1

- if the table already has an explicit schema with column 'age' as numeric, you may not need this cast.


If you use v column in Hive 2, you’ll see the following error message:

Invalid table alias or column reference 'v'

Columns selected through v column where names start with underscore need to be escaped with back quotes (`) when rewritten without v column

For example:

v['_country_code'] -> CAST(`_country_code` AS STRING)

- note the use of CAST

Make sure that the queries don’t use v[] syntax because the Hive 2 compiler will give an error when using v[] syntax. For example:

‘Invalid table alias or column reference 'v'

If the original query looks like this:

V column version:

SELECT v['user_id'], v['a_type'], v['time_to_view'], v['action'],
time
FROM m_announcements
WHEREtime > 1546156800


Rewrite this as:

SELECT cast(`user_id` as STRING), cast(`a_type` as BIGINT),
cast(`time_to_view` as DOUBLE), cast(`action` as BIGINT),
time
FROM m_announcements
WHERE time > 1546156800

 
Example: If a current query is:

SELECT email, fx_id, pl_type FROM (
SELECT v['user_id'] as uid FROM apps WHERE v['action'] = 'x_ping'AND v['user_id'] IS NOT NULL AND time > 1548748800) r
JOIN (
SELECT v['_id'] as uid, email, fx_id, pl_type FROM user_tbl WHERE time > 1548748800 AND (v['is_bot'] = 'false' OR v['is_bot'] IS NULL)) u
ON r.uid = u.uid


Rewrite this as:

SELECT email, fx_id, pl_type FROM (
SELECT cast('user_id'as STING) as uid FROM apps WHERE cast('action' as STRING) = 'x_ping' AND cast('user_id' as STRING) IS NOT NULL AND time > 1548748800) r
JOIN (
SELECT cast(`_id`asSTRING) as uid, email, fx_id, pl_type FROM user_tbl WHERE time > 1548748800 AND (cast('is_bot' as STRING) = 'false' OR cast('is_bot' as STRING) IS NULL)) u
ON r.uid = u.uid
  • No labels