This tutorial uses the Iris dataset that is provided in the UCI Machine Learning Repository.

RandomForest I/F has changed due to a v0.5.0 release on April 12, 2018.

Data Preparation

Upload Iris data to Treasure Data.

$ wget http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
$ sed '/^$/d' iris.data | awk 'BEGIN{OFS=","}{print NR,$0}' | sed '1i\
  rowid,sepal_length,sepal_width,petal_length,petal_width,class
  ' > iris.data.csv
$ head -3 iris.data.csv
rowid,sepal_length,sepal_width,petal_length,petal_width,class
1,5.1,3.5,1.4,0.2,Iris-setosa
2,4.9,3.0,1.4,0.2,Iris-setosa

$ td db:create iris
$ td table:create iris original
$ td import:auto --format csv --column-header --time-value `date +%s` --auto-create iris.original iris.data.csv

Then, create a mapping table to assign a label for each class.

$ td table:create iris label_mapping
$ td query -x --type hive -d iris "
    INSERT OVERWRITE TABLE label_mapping 
    select
      class,
      rank - 1 as label
    from (
    select
      distinct class,
      dense_rank() over (order by class) as rank
    from 
      original
    ) t;
"

`train_randomforest_classifier` requires the target `label` starting from 0.

After that, prepare a training table for RandomForest training.

$ td table:create iris training
$ td query -x --type hive -d iris "
    INSERT OVERWRITE TABLE training
    select
      rowid() as rowid,
      array(t1.sepal_length, t1.sepal_width, t1.petal_length, t1.petal_width) as features,
      t2.label
    from
      original t1
      JOIN label_mapping t2 ON (t1.class = t2.class);
  "


Train

Run training using a RandomForest classifier. The following example builds 50 decision trees for each mapper.

$ td table:create iris model
$ td query -x --type hive -d iris "
    INSERT OVERWRITE TABLE model
    select 
      train_randomforest_classifier(features, label, '-trees 50') 
    from
      training;
  "

No need to use `amplify` or `rand_amplify` for train_randomforest_classifier.


Training Options

You can get information about hyperparameter for training using -help option as follows:

$ td query -w --type hive -d iris "
    select 
      train_randomforest_classifier(features, label, '-help')
    from
      training;
  "

usage: train_randomforest_classifier(array<double|string> features, int
       label [, const string options, const array<double> classWeights])-
       Returns a relation consists of <string model_id, double
       model_weight, string model, array<double> var_importance, int
       oob_errors, int oob_tests> [-attrs <arg>] [-depth <arg>] [-help]
       [-leafs <arg>] [-min_samples_leaf <arg>] [-rule <arg>] [-seed
       <arg>] [-splits <arg>] [-stratified] [-subsample <arg>] [-trees
       <arg>] [-vars <arg>]
 -attrs,--attribute_types <arg>      Comma separated attribute types (Q
                                     for quantitative variable and C for
                                     categorical variable. e.g.,
                                     [Q,C,Q,C])
 -depth,--max_depth <arg>            The maximum number of the tree depth
                                     [default: Integer.MAX_VALUE]
 -help                               Show function help
 -leafs,--max_leaf_nodes <arg>       The maximum number of leaf nodes
                                     [default: Integer.MAX_VALUE]
 -min_samples_leaf <arg>             The minimum number of samples in a
                                     leaf node [default: 1]
 -rule,--split_rule <arg>            Split algorithm [default: GINI,
                                     ENTROPY]
 -seed <arg>                         seed value in long [default: -1
                                     (random)]
 -splits,--min_split <arg>           A node that has greater than or
                                     equals to `min_split` examples will
                                     split [default: 2]
 -stratified,--stratified_sampling   Enable Stratified sampling for
                                     unbalanced data
 -subsample <arg>                    Sampling rate in range (0.0,1.0].
                                     [default: 1.0]
 -trees,--num_trees <arg>            The number of trees for each task
                                     [default: 50]
 -vars,--num_variables <arg>         The number of random selected
                                     features [default:
                                     ceil(sqrt(x[0].length))].
                                     int(num_variables * x[0].length) is
                                     considered if num_variable is (0,1]


Parallelize Training

In Treasure Data, each MapReduce task is launched for each 512MB data chunk. Each task is permitted to use only 1 virtual CPU core and then the training time of RandomForest training is linear to the number of decision trees.

To parallelize RandomForest training by Threading, you can use UNION ALL as follows:

$ td query -x --type hive -d iris "
    INSERT OVERWRITE TABLE model
    select 
      train_randomforest_classifier(features, label, '-trees 25') 
    from
      training
    UNION ALL
    select 
      train_randomforest_classifier(features, label, '-trees 25') 
    from
      training;
  "

Alternatively, you can run multiple INSERT INTO queries for a training as follows:

$ td query -x --type hive -d iris "
    INSERT INTO TABLE model
    select
      train_randomforest_classifier(features, label, '-trees 25')
    from
      training
  "


Variable Importance and Out-of-the-Bag Test

The output of training includes information to show variable importance and the out-of-bag (OOB) test results.

$ td query -w --type hive -d iris "
    select
      array_sum(var_importance) as var_importance,
      sum(oob_errors) / sum(oob_tests) as oob_err_rate
    from
      model;
  "

var_importance

oob_err_rate

[15.419672515790172,6.40339076572934,29.40103441471922,31.947085260871326]

0.04666666666666667


Predict

We can get a prediction result using the prediction model as follows:

$ td table:create iris predicted
$ td query -w -x --type hive -d iris "
    WITH t2 as (
        SELECT
          rowid,
          rf_ensemble(predicted.value, predicted.posteriori, model_weight) as predicted
          -- rf_ensemble(predicted.value, predicted.posteriori) as predicted -- avoid OOB accuracy (i.e., model_weight)
        FROM (
          SELECT
            t.rowid,
            p.model_weight,
            tree_predict(p.model_id, p.model, t.features, '-classification') as predicted
          FROM
            model p
            LEFT OUTER JOIN training t
        ) t1
        group by
          rowid
  )
  INSERT OVERWRITE TABLE predicted
  SELECT
      rowid,
      predicted.label, predicted.probability, predicted.probabilities
  FROM
      t2
  "

To use model created by v0.4.2, use tree_predict_v1 instead of tree_predict as follows: tree_predict_v1(p.model_id, p.model_type, p.pred_model, t.features, true)


Evaluate

You can evaluate the accuracy of the training as follows:

$ td query -w --type presto -d iris "
    select count(1) from training;
  "
> 150

$ td query -w --type hive -d iris "
    WITH t1 as (
    SELECT
      t.rowid,
      t.label as actual,
      p.label as predicted
    FROM
      predicted p
      LEFT OUTER JOIN training t ON (t.rowid = p.rowid)
    )
    SELECT
      count(1) / 150.0
    FROM
      t1
    WHERE
      actual = predicted;
"

> 0.9933333333333333


Export Models in Human-Readable Format

You can export prediction models into JavaScript or Graphviz format.

$ td table:create iris model_exported
$ td query -w --type hive -d iris "
    INSERT OVERWRITE TABLE model_exported
    select
      model_id,
      tree_export(model, "-type javascript", array('sepal_length','sepal_width','petal_length','petak_width'), array('Setosa','Versicolour','Virginica')) as js,
      tree_export(model, "-type graphvis", array('sepal_length','sepal_width','petal_length','petak_width'), array('Setosa','Versicolour','Virginica')) as dot
    from
      model
 "

usage: tree_export(string model, const string options, optional
       array<string> featureNames=null, optional array<string>
       classNames=null) - exports a Decision Tree model as javascript/dot]
       [-help] [-output_name <arg>] [-r] [-t <arg>]
 -help                             Show function help
 -output_name,--outputName <arg>   output name [default: predicted]
 -r,--regression                   Is regression tree or not
 -t,--type <arg>                   Type of output [default: js,
                                   javascript/js, graphvis/dot

Graphvis dot data can be visualized on viz-js.com.


  • No labels