MovieLens 20m Rating prediction using Factorization Machine

Table of Contents

Data preparation

Download ml-20m.zip and unzip it. Then, create a database and import the raw ratings data into Treasure Data from the downloaded CSV. --time-value is used to add a dummy time column (This is because Treasure Data requires each row have a timestsamp).

$ td db:create movielens20m
$ td table:create movielens20m ratings
$ td import:auto --format csv --column-header --time-value `date +%s` --auto-create movielens20m.ratings ./ratings.csv

The first step is to split the original data for training and testing.

$ td table:create movielens20m ratings_fm

$ td query -w --type hive -d movielens20m "
    INSERT OVERWRITE TABLE ratings_fm 
    select
        rowid() as rowid,
        categorical_features(array('userid','movieid'), userid, movieid) 
          as features,
        rating,
        rand(31) as rnd
    from
        ratings;
  "

Now, split the data 80% for training and 20% for testing.

$ td table:create movielens20m training_fm

$ td query -x --type hive -d movielens20m "
  INSERT OVERWRITE TABLE training_fm 
  SELECT rowid, features, rating, rnd 
  FROM ratings_fm 
  ORDER BY rnd DESC 
  LIMIT 16000000
"

Caution: If you got “java.lang.RuntimeException: INSERT into ‘v’ field is not supported”, then disable V-column of the table through Web console. Or, avoid using ‘*’ in the query.

$ td table:create movielens20m testing_fm

$ td query -w --type hive -d movielens20m "
  INSERT OVERWRITE TABLE testing_fm 
  SELECT rowid, features, rating, rnd
  FROM ratings_fm 
  ORDER BY rnd ASC 
  LIMIT 4000263
"

$ td table:create movielens20m testing_fm_exploded

$ td query -x --type hive -d movielens20m "
INSERT OVERWRITE TABLE testing_fm_exploded 
select 
  rowid,
  extract_feature(fv) as feature,
  extract_weight(fv) as Xi,
  rating
from
  testing_fm t1 LATERAL VIEW explode(add_bias(features)) t2 as fv
"

Caution: Don’t forget to call “add_bias” in the above query. No need to call “add_bias” for preparing training data in Factorization Machines because it always considers it.

Training

$ td table:create movielens20m fm_model

$ td query -w -x --type hive -d movielens20m "
INSERT OVERWRITE TABLE fm_model 
select
  feature,
  avg(Wi) as Wi,
  array_avg(Vif) as Vif
from (
  select 
    train_fm(features, rating, '-factor 10 -iters 50 -min 1 -max 5') 
        as (feature, Wi, Vif)
  from 
    training_fm
) t
group by feature;
"

Training options

You can get information about hyperparameter for training using -help option as follows:

$ td query -w --type hive -d movielens20m "
  select 
    train_fm(features, rating, '-help') 
        as (feature, Wi, Vif)
  from 
    training_fm
"

usage: train_fm(array<string> x, double y [, const string options]) -
     Returns a prediction value [-adareg] [-c] [-cv_rate <arg>]
     [-disable_cv] [-eta <arg>] [-eta0 <arg>] [-f <arg>] [-help]
     [-init_v <arg>] [-int_feature] [-iters <arg>] [-lambda <arg>] [-max
     <arg>] [-maxval <arg>] [-min <arg>] [-min_init_stddev <arg>] [-p
     <arg>] [-power_t <arg>] [-seed <arg>] [-sigma <arg>] [-t <arg>]
     [-va_ratio <arg>] [-va_threshold <arg>]
-adareg,--adaptive_regularizaion             Whether to enable adaptive
                                            regularization [default:
                                            OFF]
-c,--classification                          Act as classification
-cv_rate,--convergence_rate <arg>            Threshold to determine
                                            convergence [default: 0.005]
-disable_cv,--disable_cvtest                 Whether to disable
                                            convergence check [default:
                                            OFF]
-eta <arg>                                   The initial learning rate
-eta0 <arg>                                  The initial learning rate
                                            [default 0.1]
-f,--factor <arg>                            The number of the latent
                                            variables [default: 10]
-help                                        Show function help
-init_v <arg>                                Initialization strategy of
                                            matrix V [random, gaussian]
                                            (default: random)
-int_feature,--feature_as_integer            Parse a feature as integer
                                            [default: OFF, ON if -p
                                            option is specified]
-iters,--iterations <arg>                    The number of iterations
                                            [default: 1]
-lambda,--lambda0 <arg>                      The initial lambda value for
                                            regularization [default:
                                            0.01]
-max,--max_target <arg>                      The maximum value of target
                                            variable
-maxval,--max_init_value <arg>               The maximum initial value in
                                            the matrix V [default: 1.0]
-min,--min_target <arg>                      The minimum value of target
                                            variable
-min_init_stddev <arg>                       The minimum standard
                                            deviation of initial matrix
                                            V [default: 0.1]
-p,--size_x <arg>                            The size of x
-power_t <arg>                               The exponent for inverse
                                            scaling learning rate
                                            [default 0.1]
-seed <arg>                                  Seed value [default: -1
                                            (random)]
-sigma <arg>                                 The standard deviation for
                                            initializing V [default:
                                            0.1]
-t,--total_steps <arg>                       The total number of training
                                            examples
-va_ratio,--validation_ratio <arg>           Ratio of training data used
                                            for validation [default:
                                            0.05f]
-va_threshold,--validation_threshold <arg>   Threshold to start
                                            validation. At least N
                                            training examples are used
                                            before validation [default:
                                            1000]

Prediction

$ td table:create movielens20m fm_predict

$ td query -w -x --type hive -d movielens20m "
INSERT OVERWRITE TABLE fm_predict 
select
  t1.rowid,
  fm_predict(p1.Wi, p1.Vif, t1.Xi) as predicted
from 
  testing_fm_exploded t1
  LEFT OUTER JOIN fm_model p1 ON (t1.feature = p1.feature)
group by
  t1.rowid
"

Evaluation

$ td query -w --type hive -d movielens20m "
  select
    mae(p.predicted, rating) as mae,
    rmse(p.predicted, rating) as rmse
  from
    testing_fm as t
    JOIN fm_predict as p on (t.rowid = p.rowid);
  "
mae rmse
0.6056794023148888 0.7964665312607863

Fast Training using Feature Hashing

Training of Factorization Machines (FM) can be done more efficiently, in terms of speed and memory consumption, by using INT features. In this section, we show how to run FM training by using int features, more specifically by using feature hashing.

Training using Feature Hashing

Caution: Hivemall uses a dense array internally when both -int_feature and -num_features are specified, otherwise uses a sparse map. Dense array is more CPU efficient than sparse map. The default number of features created by feature_hashing is 2^24 = 16777216 and thus -num_features 16777216 is the case for using feature_hashing.

$ td query -w -x --type hive -d movielens20m "
INSERT OVERWRITE TABLE fm_model
select
  feature,
  avg(Wi) as Wi,
  array_avg(Vif) as Vif
from (
  select 
    train_fm(feature_hashing(features), rating, 
      "-factor ${factor} -iters ${iters} -eta 0.01 -int_feature -num_features 16777216"
    ) as (feature, Wi, Vif)
  from 
    training_fm
) t
group by feature;
"

Prediction and Evaluation using Feature Hashing

Caution: DOT NOT forget to apply feature_hashing for test data.

$ td query -w --type hive -d movielens20m "
WITH predicted as (
  select
    t1.rowid,
    fm_predict(p1.Wi, p1.Vif, t1.Xi) as predicted
  from 
    testing_fm_exploded t1
    LEFT OUTER JOIN fm_model p1 ON (feature_hashing(t1.feature) = p1.feature)        
  group by
    t1.rowid
)
select
  mae(p.predicted, rating) as mae,
  rmse(p.predicted, rating) as rmse
from
  testing_fm as t
  JOIN predicted as p on (t.rowid = p.rowid);
"

Last modified: Jul 01 2016 13:59:28 UTC

If this article is incorrect or outdated, or omits critical information, please let us know. For all other issues, please see our support channels.