Search code examples
machine-learningvowpalwabbitmatrix-factorization

Vowpal Wabbit Matrix Factorization on one label


What I'm after is a recommender system for the web, something like "related products". Based on the items a user has bought I want to find related items based on what other users has bought. I've followed the MovieLens tutorial (https://github.com/JohnLangford/vowpal_wabbit/wiki/Matrix-factorization-example) for making a recommender system.

In the example above the users gave the movies a score (1-5). The model can then predict the score a user will give a specific item.

My data, on the other hand, only knows what the user likes. I don't know what they dislike or how much they like something. So I've tried sending 1 as the value on all my entries, but that only gives me a model that returns 1 on every prediction.

Any ideas on how I can structure my data so that I can receive prediction on how likely it is for the user to like an item between 0 and 1?

Example data:

1.0 |user 1 |item 1
1.0 |user 1 |item 2
1.0 |user 2 |item 2
1.0 |user 2 |item 3
1.0 |user 3 |item 1
1.0 |user 3 |item 3

Training command:

cat test.vw | vw /dev/stdin -b 18 -q ui --rank 10 --l2 0.001 --learning_rate 0.015 --passes 20 --decay_learning_rate 0.97 --power_t 0 -f test.reg --cache_file test.cache

Solution

  • Short answer to the question:

    To get a prediction resembling "probabilities" you could use --loss_function logistic --link logistic. Be aware that in this single-label setting your probabilities risk tending to 1.0 quickly (i.e. become meaningless).

    Additional notes:

    • Working with a single label is problematic in the sense that there's no separation of the goal. Eventually the learner will peg all predictions to 1.0. To counter that - it is recommended to use --noconstant, use strong regularization, decrease the learning rate, avoid multiple passes, etc. (IOW: anything that avoids over-fitting to the single label)
    • Even better: add examples where the user hasn't bought/clicked, they should be plentiful, this will make your model much more robust and meaningful.
    • There's a better implementation of matrix factorization in vw (much faster and lighter on IO for big models). Check the --lrq option and the full demo under demo/movielens in the source tree.
    • You should pass the training-set directly to vw to avoid Useless use of cat