What I'm after is a recommender system for the web, something like "related products". Based on the items a user has bought I want to find related items based on what other users has bought. I've followed the MovieLens tutorial (https://github.com/JohnLangford/vowpal_wabbit/wiki/Matrix-factorization-example) for making a recommender system.
In the example above the users gave the movies a score (1-5). The model can then predict the score a user will give a specific item.
My data, on the other hand, only knows what the user likes. I don't know what they dislike or how much they like something. So I've tried sending 1 as the value on all my entries, but that only gives me a model that returns 1 on every prediction.
Any ideas on how I can structure my data so that I can receive prediction on how likely it is for the user to like an item between 0 and 1?
Example data:
1.0 |user 1 |item 1
1.0 |user 1 |item 2
1.0 |user 2 |item 2
1.0 |user 2 |item 3
1.0 |user 3 |item 1
1.0 |user 3 |item 3
Training command:
cat test.vw | vw /dev/stdin -b 18 -q ui --rank 10 --l2 0.001 --learning_rate 0.015 --passes 20 --decay_learning_rate 0.97 --power_t 0 -f test.reg --cache_file test.cache
To get a prediction resembling "probabilities" you could use --loss_function logistic --link logistic
. Be aware that in this single-label setting your probabilities risk tending to 1.0 quickly (i.e. become meaningless).
--noconstant
, use strong regularization, decrease the learning rate, avoid multiple passes, etc. (IOW: anything that avoids over-fitting to the single label)vw
(much faster and lighter on IO for big models). Check the --lrq
option and the full demo under demo/movielens
in the source tree.vw
to avoid Useless use of cat