Search code examples
word2vecgensimrecommendation-engine

Gensim: Word2Vec Recommender accuracy Improvement


I am trying to implement something similar in https://arxiv.org/pdf/1603.04259.pdf using awesome gensim library however I am having trouble improving quality of results when I compare to Collaborative Filtering.

I have two models one built on Apache Spark and other one using gensim Word2Vec on grouplens 20 million ratings dataset. My apache spark model is hosted on AWS http://sparkmovierecommender.us-east-1.elasticbeanstalk.com and I am running gensim model on my local. However when I compare the results I see superior results with CF model 9 out of 10 times(like below example more similar to searched movie - affinity towards Marvel movies)

e.g.:- If I search for Thor movie I get below results

Gensim

  • Captain America: The First Avenger (2011)
  • X-Men: First Class (2011)
  • Rise of the Planet of the Apes (2011)
  • Iron Man 2 (2010)
  • X-Men Origins: Wolverine (2009)
  • Green Lantern (2011)
  • Super 8 (2011)
  • Tron:Legacy (2010)
  • Transformers: Dark of the Moon (2011)

CF

  • Captain America: The First Avenger
  • Iron Man 2
  • Thor: The Dark World
  • Iron Man
  • The Avengers
  • X-Men: First Class
  • Iron Man 3
  • Star Trek
  • Captain America: The Winter Soldier

Below is my model configuration, so far I have tried playing with window, min_count and size parameter but not much improvement.

word2vec_model = gensim.models.Word2Vec(
    seed=1,
    size=100, 
    min_count=50, 
    window=30)

word2vec_model.train(movie_list, total_examples=len(movie_list), epochs=10)

Any help in this regard is appreciated.


Solution

  • You don't mention what Collaborative Filtering algorithm you're trying, but maybe it's just better than Word2Vec for this purpose. (Word2Vec is not doing awful; why do you expect it to be better?)

    Alternate meta-parameters might do better.

    For example, the window is the max-distance between tokens that might affect each other, but the effective windows used in each target-token training randomly chosen from 1 to window, as a way to give nearby tokens more weight. Thus when some training-texts are much larger than the window (as in your example row), some of the correlations will be ignored (or underweighted). Unless ordering is very significant, a giant window (MAX_INT?) might do better, or even a related method where ordering is irrelevant (such as Doc2Vec in pure PV-DBOW dm=0 mode, with every token used as a doc-tag).

    Depending on how much data you have, your size might be too large or small. Different min_count, negative count, greater 'iter'/'epochs', or sample level might work much better. (And perhaps even things you've already tinkered with would only help after other changes are in place.)