Search code examples
pythongensimword2vec

'similar_by_word' did not improve over iterations


I'm using Gensim to train a skip-gram word2vec model. The dataset has 1 million sentences, but the vocabulary is of size 200. I would like to see the model accuracy over iterations, so I used model.wv.similar_by_word in the callback function to see the scores. But the returned values were not updated over iterations.

The iter was set to be 100. I tried to change the values of window and size, but it has no effect.

The model was initialized with callbacks:

Word2Vec(self.train_corpus, workers=multiprocessing.cpu_count(), compute_loss=True, callbacks=[A_CallBack], **word2vec_params)

In the class A_CallBack, I have something like this:

def on_epoch_end(self, model):
    word, score = model.wv.similar_by_word(word='target_word', topn=1)[0]
    print(word, score)

The word and score were printed out for every epoch, but the values have never changed.

I was expecting the values of them to be updated over iterations, which should make sense?

I'm new to machine learning and word2vec. Thanks a lot for the help.


Solution

  • The various gensim similarity functions are optimized via the pre-calculation of unit-length normed vectors, and that pre-calculation is cached in a way that doesn't expect further training to happen.

    As a result, when you first check similarities mid-training, as you've done with your callback code, the cache gets filled with the model's early state – and not refreshed after later training. There's a pending bug (as of gensim-3.8.1 in November 2019) to fix this behavior, in the meantime, you can either:

    • refrain from checking similarity-operations until after training is done, or
    • manually clear some of the caches after you've done more training. For a plain Word2Vec model, it should be enough to do: model.wv.vectors_norm = None. (Some other models require extra steps, see the bug discussion for more details.)