# Rating prediction by Matrix Factorization

## Data preparation

Download ml-20m.zip and unzip it. Then, create a database and import the raw ratings data into Treasure Data from the downloaded CSV. `--time-value` is used to add a dummy time column (This is because Treasure Data requires each row have a timestsamp).

```\$ td db:create movielens20m
\$ td table:create movielens20m ratings
\$ td import:auto --format csv --column-header --time-value `date +%s` --auto-create movielens20m.ratings ./ratings.csv
```

The first step is to split the original data for training and testing.

```\$ td table:create movielens20m ratings_mf

\$ td query -w -x -d movielens20m "
INSERT OVERWRITE TABLE ratings_mf
SELECT
rand(31) as rnd,
userid,
movieid,
rating
FROM
ratings
"
```

The data is split 80% for training and 20% for testing.

```\$ td table:create movielens20m training_mf

\$ td query -x --type hive -d movielens20m "
INSERT OVERWRITE TABLE training_mf
SELECT userid, movieid, rating, rnd
FROM ratings_mf
ORDER BY rnd DESC
LIMIT 16000000
"

\$ td table:create movielens20m testing_mf

\$ td query -x --type hive -d movielens20m "
INSERT OVERWRITE TABLE testing_mf
SELECT userid, movieid, rating, rnd
FROM ratings_mf
ORDER BY rnd ASC
LIMIT 4000263
"
```

## Training

Calculate the mean rating in the training_mf dataset.

```td query -w --type presto -d movielens20m "
SELECT AVG(rating) FROM training_mf
"

> 3.52576146875
```

The above average value used in the following queries.

Run training of Matrix Factorization. Result output to TD (â€”result) feature is not recommended for issuing the following query. Use `INSERT INTO table` statement instead.

```\$ td table:create movielens20m mf_sgd_model_f20

\$ td query -x --type hive -d movielens20m "
INSERT OVERWRITE TABLE mf_sgd_model_f20
SELECT
idx,
array_avg(u_rank) as Pu,
array_avg(m_rank) as Qi,
avg(u_bias) as Bu,
avg(m_bias) as Bi
FROM (
SELECT train_mf_sgd(userid, movieid, rating, '-factor 20 -mu 3.52576146875 -iter 50') AS (idx, u_rank, m_rank, u_bias, m_bias)
FROM training_mf
) t
GROUP BY idx
"
```

The signature of `train_mf_sgd` is `train_mf_sgd(int userid, int itemid, numeric rating [, string options])`.

### Training options

You can get information about hyperparameter for training using `-help` option as follows:

```\$ td query -w --type hive -d movielens20m "
SELECT train_mf_sgd(userid, movieid, rating, '-help') AS (idx, u_rank, m_rank, u_bias, m_bias)
FROM training_mf
"

usage: MatrixFactorizationSGDUDTF [-cv_rate <arg>] [-disable_bias]
[-disable_cv] [-eta <arg>] [-eta0 <arg>] [-help] [-iter <arg>] [-k
<arg>] [-maxval <arg>] [-min_init_stddev <arg>] [-mu <arg>]
[-power_t <arg>] [-r <arg>] [-rankinit <arg>] [-t <arg>]
[-update_mean]
-cv_rate,--convergence_rate <arg>   Threshold to determine convergence
[default: 0.005]
-disable_bias,--no_bias             Turn off bias clause
-disable_cv,--disable_cvtest        Whether to disable convergence check
[default: enabled]
-eta <arg>                          The initial learning rate [default:
0.001]
-eta0 <arg>                         The initial learning rate [default
0.2]
-help                               Show function help
-iter,--iterations <arg>            The number of iterations [default: 1]
-k,--factor <arg>                   The number of latent factor [default:
10]
-maxval,--max_init_value <arg>      The maximum initial value in the rank
matrix [default: 1.0]
-min_init_stddev <arg>              The minimum standard deviation of
initial rank matrix [default: 0.1]
-mu,--mean_rating <arg>             The mean rating [default: 0.0]
-power_t <arg>                      The exponent for inverse scaling
learning rate [default 0.1]
-r,--lambda <arg>                   The regularization factor [default:
0.03]
-rankinit <arg>                     Initialization strategy of rank
matrix [random, gaussian] (default:
random)
-t,--total_steps <arg>              The total number of training examples
-update_mean,--update_mu            Whether update (and return) the mean
rating or not
```

## Predict

```\$ td table:create movielens20m sgd_predict_f20

\$ td query -x --type hive -d movielens20m "
INSERT OVERWRITE TABLE sgd_predict_f20
SELECT
t2.actual,
mf_predict(t2.Pu, p2.Qi, t2.Bu, p2.Bi, 3.52560165625) as predicted
FROM (
SELECT
t1.userid,
t1.movieid,
t1.rating as actual,
p1.Pu,
p1.Bu
FROM
testing_mf t1 LEFT OUTER JOIN mf_sgd_model_f20 p1
ON (t1.userid = p1.idx)
) t2
LEFT OUTER JOIN mf_sgd_model_f20 p2
ON (t2.movieid = p2.idx)
"
```

## Evaluate (computes MAE and RMSE)

```\$ td query -w --type hive -d movielens20m "
SELECT
MAE(predicted, actual) AS mae,
RMSE(predicted, actual) AS rmse
FROM sgd_predict_f20
"
```
mae rmse
0.6123907679836259 0.8027164481776642

NOTE: If you are a legacy customer, you might be running an older version of Hive that does not support this feature. Please see this article about using the new version.