Treasure ML (Machine Learning)

Treasure Machine Learning is based on Hivemall, a scalable machine learning library that runs on Apache Hive. Hivemall is designed to be scalable to the number of training instances as well as the number of training features.

Table of Contents

Supported Algorithms

Hivemall provides machine learning functionality as well as feature engineering functions through UDFs/UDAFs/UDTFs of Hive.

Classification

  • Perceptron
  • Passive Aggressive (PA, PA1, PA2)
  • Confidence Weighted (CW)
  • Adaptive Regularization of Weight Vectors (AROW)
  • Soft Confidence Weighted (SCW1, SCW2)
  • AdaGradRDA (with hinge loss)
  • RandomForest
  • Factorization Machine

Regression

  • Logistic Regression using Stochastic Gradient Descent
  • AdaGrad / AdaDelta (with logistic loss)
  • Passive Aggressive Regression (PA1, PA2)
  • AROW regression
  • RandomForest
  • Factorization Machine

Recommendation

k-Nearest Neighbor

  • Minhash (LSH with jaccard index)
  • b-Bit minhash
  • Brute-force search using cosine similarity

Feature Engineering

Hivemall Generic UDFs

HIVEMALL_VERSION

HIVEMALL_VERSION function show current Hivemall version.

SELECT HIVEMALL_VERSION()

ARRAY_CONCAT

Signature

array array_concat(array<ANY> x1, array<ANY> x2, ..)

Description

ARRAY_CONCAT function returns a concatenated array.

Example

select array_concat(array(1),array(2,3))
> [1,2,3]

ARRAY_INTERSECT

Signature

array_intersect(array<ANY> x1, array<ANY> x2, ..)

Description

ARRAY_INTERSECT function returns an intersect of given arrays.

Example

select array_intersect(array(1,3,4),array(2,3,4),array(3,5))
> [3]

ARRAY_REMOVE

Signature

array_remove(array<int|text> original, int|text|array<int> target)

Description

ARRAY_REMOVE returns an array that the target is removed from the original array.

Example

select array_remove(array(1,null,3),array(1));
> [null,3]

select array_remove(array("aaa","bbb"),"bbb");
> ["aaa"]

SORT_AND_UNIQ_ARRAY

Signature

sort_and_uniq_array(array<int>)

Description

SORT_AND_UNIQ_ARRAY takes an array of type int and returns a sorted array in a natural order with duplicate elements eliminated.

Example

select sort_and_uniq_array(array(3,1,1,-2,10));
> [-2,1,3,10]

SUBARRAY_ENDWITH

Signature

subarray_endwith(array<int|text> original, int|text key)

Description

SUBARRAY_ENDWITH returns an array that ends with the specified key

Example

select subarray_endwith(array(1,2,3,4), 3);
> [1,2,3]

SUBARRAY_STARTWITH

Signature

subarray_startwith(array<int|text> original, int|text key)

Description

SUBARRAY_STARTWITH returns an array that starts with the specified key.

Example

select subarray_startwith(array(1,2,3,4), 2);
> [2,3,4]

SUBARRAY

Signature

subarray(array<int> orignal, int fromIndex, int toIndex)

Description

SUBARRAY Returns a slice of the original array between the inclusive fromIndex and the exclusive toIndex.

Example

select subarray(array(1,2,3,4,5,6), 2,4)
> [3,4]

ARRAY_AVG

Signature

array_avg(array<NUMBER>)

Description

ARRAY_AVG returns an array in which each element is the mean of a set of numbers.

ARRAY_SUM

Signature

array_sum(array<NUMBER>)

Description

ARRAY_SUM returns an array in which each element is summed up.

TO_BITS

Signature

to_bits(int[] indexes)

Description

TO_BITS returns an bitset representation if the given indexes in long[].

Example

select to_bits(array(1,2,3,128));
>[14,-9223372036854775808]

UNBITS

Signature

unbits(long[] bitset)

Description

UNBITS returns an long array of the give bitset representation

Example

select unbits(to_bits(array(1,4,2,3)));
> [1,2,3,4]

BITS_OR

Signature

bits_or(array<long> b1, array<long> b2, ..)

Description

BITS_OR returns a logical OR given bitsets.

Example

select unbits(bits_or(to_bits(array(1,4)),to_bits(array(2,3))));
> [1,2,3,4]

BITS_COLLECT

Signature

bits_collect(int|long x)

Description

BITS_COLLECT returns a bitset in array

DEFLATE

Signature

deflate(TEXT data [, const int compressionLevel])

Description

DEFLATE returns a compressed BINARY obeject by using Deflater. The compression level must be in range [-1,9].

Example

select base91(deflate('aaaaaaaaaaaaaaaabbbbccc'));
> AA+=kaIM|WTt!+wbGAA

INFLATE

Signature

inflate(BINARY compressedData)

Description

INFLATE returns a decompressed STRING by using Inflater

Example

select inflate(unbase91(base91(deflate('aaaaaaaaaaaaaaaabbbbccc'))));
> aaaaaaaaaaaaaaaabbbbccc

MAP_GET_SUM

Signature

map_get_sum(map<int,float> src, array<int> keys)

Description

MAP_GET_SUM returns sum of values that are retrieved by keys.

MAP_TAIL_N

Signature

map_tail_n(map SRC, int N)

Description

MAP_TAIL_N returns the last N elements from a sorted array of SRC.

TO_MAP

Signature

to_map(key, value)

Description

TO_MAP converts two aggregated columns into a key-value map,

Example

See this example.

TO_ORDERED_MAP

Signature

to_ordered_map(key, value [, const boolean reverseOrder=false])

Description

TO_ORDERED_MAP converts two aggregated columns into an ordered key-value map

ROWID

Signature

rowid()

Description

ROWID returns a generated row id of a form {TASK_ID}–{SEQUENCE_NUMBER}

SIGMOID

Signature

sigmoid(x)

Description

SIGMOID returns 1.0 / (1.0 + exp(-x))

BASE91

Signature

base91(binary)

Description

BASE91 converts the argument from binary to a BASE91 string

Example

select base91(deflate('aaaaaaaaaaaaaaaabbbbccc'));
> AA+=kaIM|WTt!+wbGAA

UNBASE91

Signature

unbase91(string)

Description

UNBASE91 converts a BASE91 string to a binary

Example

select inflate(unbase91(base91(deflate('aaaaaaaaaaaaaaaabbbbccc'))));
> aaaaaaaaaaaaaaaabbbbccc

NORMALIZE_UNICODE

Signature

normalize_unicode(string str [, string form])

Description

NORMALIZE_UNICODE transforms str with the specified normalization form. The form takes one of NFC (default), NFD, NFKC, or NFKD

Example

select normalize_unicode('ハンカクカナ','NFKC');
> 
select normalize_unicode('㈱㌧㌦Ⅲ','NFKC');
> ()III

SPLIT_WORDS

Signature

split_words(string query [, string regex])

Description

SPLIT_WORDS returns an array containing splitted strings

IS_STOPWORD

Signature

is_stopword(string word)

Description

IS_STOPWORD returns whether English stopword or not.

TOKENIZE

Signature

tokenize(string englishText [, boolean toLowerCase])

Description

TOKENIZE returns words in array. More details.

TOKENIZE_JA

Signature

tokenize_ja(String line [, const string mode = "normal", const list<string> stopWords, const list<string> stopTags])

Description

TOKENIZE_JA returns tokenized strings in array. More details

Example

select tokenize_ja("kuromojiを使った分かち書きのテストです。第二引数にはnormal/search/extendedを指定できます。デフォルトではnormalモードです。");
>["kuromoji","使う","分かち書き","テスト","","","引数","normal","search","extended","指定","デフォルト","normal"," モード"]

CONVERT_LABEL

Signature

convert_label(const int|const float)

Description

CONVERT_LABEL converts from -1|1 to 0.0f|1.0f, or from 0.0f|1.0f to -1|1

EACH_TOP_K

Signature

each_top_k(int K, Object group, double cmpKey, *)

Description

EACH_TOP_K returns top-K values (or tail-K values when k is less than 0).

GENERATE_SERIES

Signature

generate_series(const int|bigint start, const int|bigint end)

Description

GENERATE_SERIES generates a series of values, from start to end. A similar function to PostgreSQL’s generate_serics.

Example

WITH dual as (
  select 1
)
select generate_series(1,9)
from dual;
>
1
2
3
4
5
6
7
8
9

X_RANK

Signature

x_rank(KEY)

Description

X_RANK generates a pseudo sequence number starting from 1 for each key.

Further Reading

Please refer Hivemall wiki for further reading.


Last modified: Jan 12 2017 00:05:31 UTC

If this article is incorrect or outdated, or omits critical information, please let us know. For all other issues, please see our support channels.