I am trying to implement something similar in https://arxiv.org/pdf/1603.04259.pdf using awesome gensim library however I am having trouble improving quality of results when I compare to Collaborative Filtering.
I have two models one built on Apache Spark and other one using gensim Word2Vec on grouplens 20 million ratings dataset. My apache spark model is hosted on AWS http://sparkmovierecommender.us-east-1.elasticbeanstalk.com and I am running gensim model on my local. However when I compare the results I see superior results with CF model 9 out of 10 times(like below example more similar to searched movie - affinity towards Marvel movies)
e.g.:- If I search for Thor movie I get below results
Gensim
CF
Below is my model configuration, so far I have tried playing with window, min_count and size parameter but not much improvement.
word2vec_model = gensim.models.Word2Vec(
seed=1,
size=100,
min_count=50,
window=30)
word2vec_model.train(movie_list, total_examples=len(movie_list), epochs=10)
Any help in this regard is appreciated.
You don't mention what Collaborative Filtering algorithm you're trying, but maybe it's just better than Word2Vec
for this purpose. (Word2Vec
is not doing awful; why do you expect it to be better?)
Alternate meta-parameters might do better.
For example, the window
is the max-distance between tokens that might affect each other, but the effective windows used in each target-token training randomly chosen from 1 to window
, as a way to give nearby tokens more weight. Thus when some training-texts are much larger than the window
(as in your example row), some of the correlations will be ignored (or underweighted). Unless ordering is very significant, a giant window
(MAX_INT?) might do better, or even a related method where ordering is irrelevant (such as Doc2Vec
in pure PV-DBOW dm=0
mode, with every token used as a doc-tag).
Depending on how much data you have, your size
might be too large or small. Different min_count
, negative
count, greater 'iter'/'epochs', or sample
level might work much better. (And perhaps even things you've already tinkered with would only help after other changes are in place.)