Iris multiclass classification by RandomForest

This tutorial uses Iris dataset provided in UCI Machine Learning Repository.

Table of Contents

Data preparation

Upload Iris data to Treasure Data.

$ wget http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
$ sed '/^$/d' iris.data | awk 'BEGIN{OFS=","}{print NR,$0}' | sed '1i\
  rowid,sepal_length,sepal_width,petal_length,petak_width,class
  ' > iris.data.csv
$ head -3 iris.data.csv
rowid,sepal_length,sepal_width,petal_length,petak_width,class
1,5.1,3.5,1.4,0.2,Iris-setosa
2,4.9,3.0,1.4,0.2,Iris-setosa

$ td db:create iris
$ td table:create iris original
$ td import:auto --format csv --column-header --time-value `date +%s` --auto-create iris.original iris.data.csv

Then, create a mapping table to assign a label for each class.

$ td table:create iris label_mapping
$ td query -x --type hive -d iris "
    INSERT OVERWRITE TABLE label_mapping 
    select
      class,
      rank - 1 as label
    from (
    select
      distinct class,
      dense_rank() over (order by class) as rank
    from 
      original
    ) t;
"
Untitled-3
`train_randomforest_classifier` requires the target `label` starting from 0.

After that, prepare a training table for RandomForest training.

$ td table:create iris training
$ td query -x --type hive -d iris "
    INSERT OVERWRITE TABLE training
    select
      rowid() as rowid,
      array(t1.sepal_length, t1.sepal_width, t1.petal_length, t1.petak_width) as features,
      t2.label
    from
      original t1
      JOIN label_mapping t2 ON (t1.class = t2.class);
  "

Train

Run training using a RandomForest classifier. We set to build 50 decision trees in the following query.

$ td table:create iris model
$ td query -x --type hive -d iris "
    INSERT OVERWRITE TABLE model
    select 
      train_randomforest_classifier(features, label, '-trees 50') 
        -- as (model_id, model_type, pred_model, var_importance, oob_errors, oob_tests)
    from
      training;
  "
Untitled-3
No need to use `amplify` or `rand_amplify` for train_randomforest_classifier.

Training options

You can get information about hyperparameter for training using -help option as follows:

$ td query -w --type hive -d iris "
    select 
      train_randomforest_classifier(features, label, '-help')
    from
      training;
  "

  usage: train_randomforest_classifier(double[] features, int label [,
         string options]) - Returns a relation consists of <int model_id,
         int model_type, string pred_model, array<double> var_importance,
         int oob_errors, int oob_tests> [-attrs <arg>] [-depth <arg>]
         [-disable_compression] [-help] [-leafs <arg>] [-min_samples_leaf
         <arg>] [-output <arg>] [-rule <arg>] [-seed <arg>] [-splits <arg>]
         [-trees <arg>] [-vars <arg>]
   -attrs,--attribute_types <arg>   Comma separated attribute types (Q for
                                    quantitative variable and C for
                                    categorical variable. e.g., [Q,C,Q,C])
   -depth,--max_depth <arg>         The maximum number of the tree depth
                                    [default: Integer.MAX_VALUE]
   -disable_compression             Whether to disable compression of the
                                    output script [default: false]
   -help                            Show function help
   -leafs,--max_leaf_nodes <arg>    The maximum number of leaf nodes
                                    [default: Integer.MAX_VALUE]
   -min_samples_leaf <arg>          The minimum number of samples in a leaf
                                    node [default: 1]
   -output,--output_type <arg>      The output type (serialization/ser or
                                    opscode/vm or javascript/js) [default:
                                    serialization]
   -rule,--split_rule <arg>         Split algorithm [default: GINI, ENTROPY]
   -seed <arg>                      seed value in long [default: -1
                                    (random)]
   -splits,--min_split <arg>        A node that has greater than or equals
                                    to `min_split` examples will split
                                    [default: 2]
   -trees,--num_trees <arg>         The number of trees for each task
                                    [default: 50]
   -vars,--num_variables <arg>      The number of random selected features
                                    [default: ceil(sqrt(x[0].length))].
                                    int(num_variables * x[0].length) is
                                    considered if num_variable is (0,1]

Parallelize Training

In Treasure Data, each MapReduce task is launched for each 512MB data chunk. Each task is permitted to use only 1 virtual CPU core and then the training time of RandomForest training is linear to the number of decision trees.

To parallelize RandomForest training, you can use UNION ALL as follows:

$ td query -x --type hive -d iris "
    INSERT OVERWRITE TABLE model
    select 
      train_randomforest_classifier(features, label, '-trees 25') 
    from
      training
    UNION ALL
    select 
      train_randomforest_classifier(features, label, '-trees 25') 
    from
      training;
  "

In the above query, two tasks build 25 decision trees in parallel.

Variable importance and Out-of-the-Bag test

The output of training includes information to show variable importance and out-of-bag (OOB) test results.

$ td query -w --type hive -d iris "
    select
      array_sum(var_importance) as var_importance,
      sum(oob_errors) / sum(oob_tests) as oob_err_rate
    from
      model;
  "
var_importance oob_err_rate
[15.419672515790172,6.40339076572934,29.40103441471922,31.947085260871326] 0.04666666666666667

Build a prediction model in Javascript

We can get the prediction model in a human-reable Javascript format as follows:

$ td table:create iris model_javascript
$ td query -x --type hive -d iris "
    INSERT OVERWRITE TABLE model_javascript
    select 
      train_randomforest_classifier(features, label, '-trees 50 -output_type js -disable_compression') 
        as (model_id, model_type, pred_model, var_importance, oob_errors, oob_tests)
    from
      training;
  "

$ td query -w --type presto -d iris "
    select pred_model from model_javascript limit 1
  "

if(x[3] <= 0.8) {
  0;
} else  {
  if(x[2] <= 4.85) {
    if(x[3] <= 1.65) {
      1;
    } else  {
      if(x[0] <= 5.4) {
        2;
      } else  {
        1;
      }
    }
  } else  {
    if(x[1] <= 2.35) {
      2;
    } else  {
      if(x[3] <= 1.75) {
        if(x[2] <= 5.449999999999999) {
          1;
        } else  {
          2;
        }
      } else  {
        2;
      }
    }
  }
}
Untitled-3
Javascript evaluation is NOT supported in Treasure Data to avoid untrusted code execution.

Predict

We can get a prediction result using the prediction model as follows:

$ td table:create iris predicted_vm
$ td query -w -x --type hive -d iris "
    WITH t1 as (
        SELECT
          rowid,
          rf_ensemble(predicted) as predicted
        FROM (
          SELECT
            t.rowid, 
            tree_predict(p.model_id, p.model_type, p.pred_model, t.features, true) as predicted
          FROM
            model p
            LEFT OUTER JOIN training t
        ) t1
        group by
          rowid
  )
  INSERT OVERWRITE TABLE predicted_vm
  SELECT
      rowid,
      predicted.label, predicted.probability, predicted.probabilities
  FROM
      t1;
  "

Evaluate

You can evaluate the accuracy of the training as follows:

$ td query -w --type presto -d iris "
    select count(1) from training;
  "
> 150

$ td query -w --type hive -d iris "
    WITH t1 as (
    SELECT
      t.rowid,
      t.label as actual,
      p.label as predicted
    FROM
      predicted_vm p
      LEFT OUTER JOIN training t ON (t.rowid = p.rowid)
    )
    SELECT
      count(1) / 150.0
    FROM
      t1
    WHERE
      actual = predicted;
"

> 1.0

Last modified: Jan 12 2016 04:59:00 UTC

If this article is incorrect or outdated, or omits critical information, please let us know. For all other issues, please see our support channels.