Supported Hive UDFs (User Defined Functions)

This article lists all the UDFs supported on Treasure Data. Treasure Data supports three types of UDFs: (1) native UDFs supported by Hive and (2) Treasure Data specific functions whose names start with TD_, and (3) Hivemall specific functions.

Table of Contents

Hive native UDFs

All native UDFs on Hive can also be used on Treasure Data. The complete list of UDFs can be found on the HiveQL Language Manual’s UDF Page.

Treasure Data specific UDFs

TD_TIME_RANGE

Signature

boolean TD_TIME_RANGE(int/long unix_timestamp,
                      int/long/string start_time,
                      int/long/string end_time
                      [, string default_timezone = 'UTC'])

Description

Untitled-3
We strongly recommend that you take advantage of time-based partitioning. Please refer to the Performance Tuning article for more information.

This UDF returns true if unix_timestamp is equal to or later than start_time and older than end_time (start_time <= time && time < end_time). If end_time is omitted or NULL, the UDF assumes it’s infinite. If start_time is NULL, the UDF assumes it’s 0.

start_time and end_time could be a string which represents a time (e.g. “2012-01-01 00:00:00 +0900”) or a UNIX timestamp (e.g. 1325343600). If the format of start_time or end_time strings is invalid, the UDF returns NULL.

default_timezone is used to interpret the timezone of start_time or end_time. If start_time or end_time themselves specify a timezone (e.g. “2012-01-01 +0700”), then default_timezone is ignored. If default_timezone is not specified while start_time or end_time also does not specify a timezone, then the UDF uses ‘UTC’ as the timezone for start_time or end_time. A list of supported time zones can be found here.

Example

This example selects records with timestamps “2013-01-01 00:00:00 PDT” or later.

SELECT ... WHERE TD_TIME_RANGE(time, "2013-01-01 PDT")
SELECT ... WHERE TD_TIME_RANGE(time, "2013-01-01", NULL, "PDT")
Untitled-3
Notice that the time of day ("00:00:00") can be omitted, as shown above. Alternately, the time of day can be specified up to seconds. In general, the time string should be formatted as either "YYYY-MM-DD" or "YYYY-MM-DD hh:mm:ss", e.g., "2013-01-01" or "1999-01-01 07:00:00".

TD_SCHEDULED_TIME

Signature

long TD_SCHEDULED_TIME()

Description

This UDF returns the exact time when the job was scheduled by the scheduled query feature. The returned value may differ from NOW() because the actual query start time may be delayed.

If the query is not a scheduled query, the UDF returns the time when the job was issued. You may use this UDF with TD_TIME_ADD for incremental aggregation.

TD_TIME_ADD

Signature

long TD_TIME_ADD(int/long/string time,
                 string duration
                 [, string default_timezone = 'UTC'])

Description

This UDF returns a timestamp equal to time offset by duration. The UDF supports the following formats for the duration:

  • “Nd”: after N days (e.g. “1d”, “2d”, “30d”)
  • “-Nd”: before N days (e.g. “-1d”, “-2d”, “-30d”)
  • “Nh”: after N hours (e.g. “1h”, “2h”, “48h”)
  • “-Nh”: before N hours (e.g. “-1h”, “-2h”, “-48h”)
  • “Nm”: after N minutes (e.g. “1m”, “2m”, “90m”)
  • “-Nm”: before N minutes (e.g. “-1m”, “-2m”, “-90m”)
  • “Ns”: after N seconds (e.g. “1s”, “2s”, “90s”)
  • “-Ns”: before N seconds (e.g. “-1s”, “-2s”, “-90s”)

The formats above can be combined. For example, “1h30m” means “after 1 hour and 30 minutes”.

default_timezone is used to interpret time. If time itself has timezone (e.g. “2012-01-01 +0700”), then default_timezone is ignored. If default_timezone is not specified while time also does not specify a timezone, then the UDF uses ‘UTC’ as the timezone for time. A list of supported time zones can be found here.

If the formats of the time or duration strings are invalid, the UDF returns NULL.

Untitled-3
"year" and "month" durations are NOT supported, because these have complicated implications. A month can be 28, 29, 30, or 31 days, and a year could be 365 or 366 days. To implement these, this function will become a lot heavier and impact performance.

Example

This example selects records with timestamps “2013-01-01 00:00:00 UTC” or later but older than “2013-01-02 00:00:00 UTC”.

SELECT ... WHERE TD_TIME_RANGE(time,
                               "2013-01-01",
                               TD_TIME_ADD("2013-01-01", "1d"))

A typical use of this function within scheduled queries is in conjunction with TD_TIME_RANGE and TD_SCHEDULED_TIME to narrow the time range to determined set of days, hours, minutes, or seconds. For example:

SELECT ... WHERE TD_TIME_RANGE(time,
                               TD_TIME_ADD(TD_SCHEDULED_TIME(), "-1d"),
                               TD_SCHEDULED_TIME())

TD_TIME_FORMAT

Signature

string TD_TIME_FORMAT(long unix_timestamp,
                      string format
                      [, string timezone = 'UTC'])

Description

This UDF converts a UNIX timestamp to a string with the specified format (see the Supported time formats in TD_TIME_FORMAT UDF page for available formats). For example, “yyyy-MM-dd HH:mm:ss z” converts 1325376000 to “2012-01-01 00:00:00 UTC”. If no timezone is specified, the UDF uses UTC.

Example

This example formats a UNIX timestamp into a date formatted string

SELECT TD_TIME_FORMAT(time, 'yyyy-MM-dd HH:mm:ss z') ... FROM ...
SELECT TD_TIME_FORMAT(time, 'yyyy-MM-dd HH:mm:ss z', 'PST') ... FROM ...
SELECT TD_TIME_FORMAT(time, 'yyyy-MM-dd HH:mm:ss z', 'JST') ... FROM ...

TD_TIME_PARSE

Signature

long TD_TIME_PARSE(string time
                   [, string default_timezone = 'UTC'])

Description

This UDF converts a time string into a UNIX timestamp.

default_timezone is used to interpret time. If time itself has timezone (e.g. “2012-01-01 +0700”), then default_timezone is ignored. If default_timezone is not specified while time also does not specify a timezone, then the UDF uses ‘UTC’ as the timezone for time. A list of supported time zones can be found here.

If the format of the time string is invalid, the UDF returns NULL.

TD_DATE_TRUNC

Signature

long TD_DATE_TRUNC(string unit,
                   long time
                   [, string default_timezone = 'UTC'])

Description

This UDF performs a timestamp truncation at the level specified by the ‘unit’ parameter. The supported units are:

  • ‘minute’
  • ‘hour’
  • ‘day’
  • ‘week’
  • ‘month’
  • ‘quarter’
  • ‘year’

An optional ‘timezone’ parameter can be specified to indicate an alternative reference timezone the ‘unit’ is referenced to. While the input ‘time’ is in global Unix time format, in different timezones ‘day’ that is the start of a day corresponds to different times.

This function mimics the functionality of native Presto’s date_trunc function, the main difference being that Presto’s date_trunc does not allow specification of the timezone, since it uses the sessions' reference timezone.

Example

SELECT TD_DATE_TRUNC('day', time) FROM tbl

with time equal 1416787667 corresponding to ‘2014-11-24 00:07:47 UTC’ will return 1416787200 corresponding to ‘2014-11-24 00:00:00 UTC’.

With the same value and timezone ‘PST’ instead:

SELECT TD_DATE_TRUNC('day', time, 'PST') FROM tbl

the function returns 1416758400 since the start of day for the ‘PST’ timezone is 8 hours behind the start of day for ‘UTC’.

TD_LAST

Signature

TD_LAST(ret_col, cmp_col1, cmp_col2, ...)

Description

This aggregate UDF finds the row with the largest value in the ‘comparison column’ (cmp_col1) and returns that row’s ‘retrieve column’ (ret_col) value. Additional comparison columns (cmp_col2, cmp_col3, etc.) are used as tiebreakers when the preceding comparison column has more than one row with the largest value.

Example

This example selects the URL of the most recent access log (the row with the largest time) for each user.

SELECT user, TD_LAST(url, time) AS last_url FROM access_logs GROUP BY user

TD_FIRST

Signature

TD_FIRST(ret_col, cmp_col1, cmp_col2, ...)

Description

This aggregate UDF finds the row with the smallest value in the ‘comparison column’ (cmp_col1) and returns that row’s ‘retrieve column’ (ret_col) value. Additional comparison columns (cmp_col2, cmp_col3, etc.) are used as tiebreakers when the preceding comparison column has more than one row with the smallest value.

Example

This example selects the referer URL of the earliest access log (the row with the smallest time) for each page_id.

SELECT page_id, TD_FIRST(referer, time) AS first_referer FROM access_logs GROUP BY page_id

TD_X_RANK

Signature

long TD_X_RANK(keys)

Description

Returns the rank of each row within the partition of the result set. The rank of a row is one plus the number of ranks that come before the row.

Example

You need to CLUSTER BY or ORDER BY within a sub query to use this feature. CLUSTER BY is more scalable than ORDER BY because it doesn’t require total order across multiple nodes, thus allowing us to process the query in parallel.

SELECT TD_X_RANK(c), c, u FROM
  (SELECT country AS c, user_id AS u
   FROM users CLUSTER BY c) t

SELECT TD_X_RANK(c, lc1), c, lc1, u FROM
  (select country AS c, location1 AS lc1, user_id AS u
   FROM users CLUSTER BY c, lc1) t

TD_SESSIONIZE

Signature

string TD_SESSIONIZE(int/long unix_timestamp, int timeout, string sessionize_by)

Description

This UDF takes three arguments:

  1. The time field specified in unix epoch
  2. A timeout window in seconds (when this amount of time elapses, it indicates the start of a new session)
  3. The field name to sessionize by

It returns a UUID for the session of the request.

Example

Similiar to TD_X_RANK, you need to CLUSTER BY or DISTRIBUTE BY the sessionize_by field, and SORT BY the unix_timestamp field within a sub query to use this feature. The query below sessionizes based on user_id and the timestamp. You may want to use user_id or cookie_id instead of ip_address for non-anonymous logs.

SELECT TD_SESSIONIZE(time, 3600, ip_address) as session_id, time, ip_address, path
FROM (
  SELECT time, ip_address, path
  from web_logs
  distribute by ip_address
  sort by ip_address,time
) t

TD_PARSE_USER_AGENT

Signature

string TD_PARSE_USER_AGENT(user_agent string [, options string])

Description

Untitled-3
TD_PARSE_USER_AGENT is now deprecated, will not be updated from now on and will be disabled in future. We recommend to use TD_PARSE_AGENT.

This UDF returns the result of parsing a user agent string. The user agent is parsed by the basis of rules. Accepts the following options as a string, as user options:

os, os_family, os_major, os_minor, ua, ua_family, ua_major, ua_minor, device

os and ua return json. With _family, _major and _minor return a string. The device option also returns a string.

Example

The example shows the result of parsing user agent from access log.

SELECT TD_PARSE_USER_AGENT(agent) AS agent FROM www_access
> {"user_agent": {"family": "IE", "major": "9", "minor": "0", "patch": ""}, "os": {"family": "Windows 7", "major": "", "minor": "", "patch": "", "patch_minor": ""}, "device": {"family": "Other"}}
SELECT TD_PARSE_USER_AGENT(agent, 'os') AS agent_os FROM www_access
> {"family": "Windows 7", "major": "", "minor": "", "patch": "", "patch_minor": ""}
SELECT TD_PARSE_USER_AGENT(agent, 'os_family') AS agent_os_family FROM www_access
> Windows 7

TD_PARSE_AGENT

Signature

MAP(string,string) TD_PARSE_AGENT(user_agent string)

This UDF returns a Map value of result to parse a user agent string. The UDF is implemented by Woothee.

Example

The example shows the result of parsing user agent from access log.

SELECT TD_PARSE_AGENT(agent) AS parsed_agent, agent FROM www_access
> {"os":"Windows 7","vendor":"Google","os_version":"NT 6.1","name":"Chrome","category":"pc","version":"16.0.912.77"},
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7
SELECT TD_PARSE_AGENT(agent)['os'] AS os FROM www_access
> Windows 7 => os from user-agent, or carrier name of mobile phones
SELECT TD_PARSE_AGENT(agent)['vendor'] AS vendor FROM www_access
> Google // => name of vendor
SELECT TD_PARSE_AGENT(agent)['os_version'] AS os_version FROM www_access
> NT 6.1 // => "NT 6.3" (for Windows), "10.8.3" (for OSX), "8.0.1" (for iOS), ....
SELECT TD_PARSE_AGENT(agent)['name'] AS name FROM www_access
> Chrome // => name of browser (or string like name of user-agent)
SELECT TD_PARSE_AGENT(agent)['category'] AS category FROM www_access
> pc // => "pc", "smartphone", "mobilephone", "appliance", "crawler", "misc", "unknown"
SELECT TD_PARSE_AGENT(agent)['version'] AS version FROM www_access
> 16.0.912.77 => version of browser, or terminal type name of mobile phones

TD_MD5

Signature

string TD_MD5(col)

Description

This UDF calculates the MD5 hash digest from a given string.

Example

SELECT TD_MD5(column) FROM tbl

TD_URL_DECODE

Signature

string TD_URL_DECODE(col)

Description

This UDF applies URL decoding for a given string.

Example

SELECT TD_URL_DECODE(column) FROM tbl

TD_LAT_LONG_TO_COUNTRY

Signature

string TD_LAT_LONG_TO_COUNTRY(string type, double latitude, double longitude)

Description

This UDF converts geo location information (latitude/longitude) to country name.

Example

SELECT
  TD_LAT_LONG_TO_COUNTRY('FULL_NAME',                 37, -122)
  TD_LAT_LONG_TO_COUNTRY('THREE_LETTER_ABBREVIATION', 37, -122)
  TD_LAT_LONG_TO_COUNTRY('POSTAL_ABBREVIATION',       37, -122)
  TD_LAT_LONG_TO_COUNTRY('SORTABLE_NAME',             37, -122)

TD_SUBSTRING_INENCODING

Signature

string TD_SUBSTRING_INENCODING(string str, int max_len_inbytes, string charset)

Description

This UDF returns the substring or slice of the byte array of str from the 0-index position at most max_len_inbytes with charset encoding. charset can be selected from java.nio.charset.Charset.

Example

SELECT TD_SUBSTRING_INENCODING(column, 10, 'UTF-8') FROM tbl

TD_DIVIDE

Signature

double TD_DIVIDE(double numerator, double denominator)

Description

This UDF returns the division of numeric types safely even if denominator is zero. If denominator is zero, it returns 0 and in other case it returns the numerator divided by denominator.

Example

SELECT TD_DIVIDE(nume, denom) FROM tbl;

TD_SUMIF

Signature

double TD_SUMIF(double column, boolean predicate)

Description

This UDF returns the sum of the column which satisfies the predidate.

Example

SELECT TD_SUMIF(amount, amount > 0) FROM tbl;

TD_AVGIF

Signature

double TD_AVGIF(double column, boolean predicate)

Description

This UDF returns the average of column which satisfies the predicate.

Example

SELECT TD_AVGIF(age, age > 20) FROM tbl;

Hivemall generic UDFs

Treasure Data bundles Hivemall, the scalable machine learning library for Hive. Hivemall generic UDFs also be used on Treasure Data. The list of UDFs can be found on the Hivemall’s generic functions page.


Last modified: Oct 17 2016 04:36:26 UTC

If this article is incorrect or outdated, or omits critical information, please let us know. For all other issues, please see our support channels.