Rating prediction by Matrix Factorization

Table of Contents

Data preparation

Download ml-20m.zip and unzip it. Then, create a database and import the raw ratings data into Treasure Data from the downloaded CSV. --time-value is used to add a dummy time column (This is because Treasure Data requires each row have a timestsamp).

$ td db:create movielens20m
$ td table:create movielens20m ratings
$ td import:auto --format csv --column-header --time-value `date +%s` --auto-create movielens20m.ratings ./ratings.csv

The first step is to split the original data for training and testing.

$ td table:create movielens20m ratings_mf

$ td query -w -x -d movielens20m "
INSERT OVERWRITE TABLE ratings_mf 
SELECT 
  rand(31) as rnd, 
  userid, 
  movieid, 
  rating 
FROM 
  ratings
"

The data is split 80% for training and 20% for testing.

$ td table:create movielens20m training_mf

$ td query -x --type hive -d movielens20m "
  INSERT OVERWRITE TABLE training_mf 
  SELECT userid, movieid, rating, rnd
  FROM ratings_mf 
  ORDER BY rnd DESC 
  LIMIT 16000000
"


$ td table:create movielens20m testing_mf

$ td query -x --type hive -d movielens20m "
  INSERT OVERWRITE TABLE testing_mf 
  SELECT userid, movieid, rating, rnd
  FROM ratings_mf 
  ORDER BY rnd ASC 
  LIMIT 4000263
"

Training

Calculate the mean rating in the training_mf dataset.

td query -w --type presto -d movielens20m "
  SELECT AVG(rating) FROM training_mf
"

> 3.52576146875

The above average value used in the following queries.

Run training of Matrix Factorization. Result output to TD (—result) feature is not recommended for issuing the following query. Use INSERT INTO table statement instead.

$ td table:create movielens20m mf_sgd_model_f20

$ td query -x --type hive -d movielens20m "
    INSERT OVERWRITE TABLE mf_sgd_model_f20 
    SELECT 
      idx, 
      array_avg(u_rank) as Pu, 
      array_avg(m_rank) as Qi, 
      avg(u_bias) as Bu, 
      avg(m_bias) as Bi 
    FROM ( 
      SELECT train_mf_sgd(userid, movieid, rating, '-factor 20 -mu 3.52576146875 -iter 50') AS (idx, u_rank, m_rank, u_bias, m_bias) 
      FROM training_mf 
    ) t 
    GROUP BY idx
"

The signature of train_mf_sgd is train_mf_sgd(int userid, int itemid, numeric rating [, string options]).

Training options

You can get information about hyperparameter for training using -help option as follows:

$ td query -w --type hive -d movielens20m "
    SELECT train_mf_sgd(userid, movieid, rating, '-help') AS (idx, u_rank, m_rank, u_bias, m_bias) 
    FROM training_mf 
"

  usage: MatrixFactorizationSGDUDTF [-cv_rate <arg>] [-disable_bias]
         [-disable_cv] [-eta <arg>] [-eta0 <arg>] [-help] [-iter <arg>] [-k
         <arg>] [-maxval <arg>] [-min_init_stddev <arg>] [-mu <arg>]
         [-power_t <arg>] [-r <arg>] [-rankinit <arg>] [-t <arg>]
         [-update_mean]
   -cv_rate,--convergence_rate <arg>   Threshold to determine convergence
                                       [default: 0.005]
   -disable_bias,--no_bias             Turn off bias clause
   -disable_cv,--disable_cvtest        Whether to disable convergence check
                                       [default: enabled]
   -eta <arg>                          The initial learning rate [default:
                                       0.001]
   -eta0 <arg>                         The initial learning rate [default
                                       0.2]
   -help                               Show function help
   -iter,--iterations <arg>            The number of iterations [default: 1]
   -k,--factor <arg>                   The number of latent factor [default:
                                       10]
   -maxval,--max_init_value <arg>      The maximum initial value in the rank
                                       matrix [default: 1.0]
   -min_init_stddev <arg>              The minimum standard deviation of
                                       initial rank matrix [default: 0.1]
   -mu,--mean_rating <arg>             The mean rating [default: 0.0]
   -power_t <arg>                      The exponent for inverse scaling
                                       learning rate [default 0.1]
   -r,--lambda <arg>                   The regularization factor [default:
                                       0.03]
   -rankinit <arg>                     Initialization strategy of rank
                                       matrix [random, gaussian] (default:
                                       random)
   -t,--total_steps <arg>              The total number of training examples
   -update_mean,--update_mu            Whether update (and return) the mean
                                       rating or not

Predict

$ td table:create movielens20m sgd_predict_f20

$ td query -x --type hive -d movielens20m "
  INSERT OVERWRITE TABLE sgd_predict_f20 
  SELECT 
      t2.actual, 
      mf_predict(t2.Pu, p2.Qi, t2.Bu, p2.Bi, 3.52560165625) as predicted 
  FROM ( 
      SELECT 
        t1.userid, 
        t1.movieid, 
        t1.rating as actual, 
        p1.Pu, 
        p1.Bu 
      FROM 
        testing_mf t1 LEFT OUTER JOIN mf_sgd_model_f20 p1 
        ON (t1.userid = p1.idx) 
  ) t2 
  LEFT OUTER JOIN mf_sgd_model_f20 p2 
  ON (t2.movieid = p2.idx)
"

Evaluate (computes MAE and RMSE)

$ td query -w --type hive -d movielens20m "
  SELECT 
    MAE(predicted, actual) AS mae, 
    RMSE(predicted, actual) AS rmse 
  FROM sgd_predict_f20
"
mae rmse
0.6123907679836259 0.8027164481776642

NOTE: If you are a legacy customer, you might be running an older version of Hive that does not support this feature. Please see this article about using the new version.


Last modified: Jun 21 2016 06:38:51 UTC

If this article is incorrect or outdated, or omits critical information, please let us know. For all other issues, please see our support channels.