I'm using Gensim to train a skip-gram word2vec model. The dataset has 1 million sentences, but the vocabulary is of size 200. I would like to see the model accuracy over iterations, so I used model.wv.similar_by_word
in the callback function to see the scores. But the returned values were not updated over iterations.
The iter
was set to be 100
.
I tried to change the values of window
and size
, but it has no effect.
The model was initialized with callbacks:
Word2Vec(self.train_corpus, workers=multiprocessing.cpu_count(), compute_loss=True, callbacks=[A_CallBack], **word2vec_params)
In the class A_CallBack
, I have something like this:
def on_epoch_end(self, model):
word, score = model.wv.similar_by_word(word='target_word', topn=1)[0]
print(word, score)
The word
and score
were printed out for every epoch, but the values have never changed.
I was expecting the values of them to be updated over iterations, which should make sense?
I'm new to machine learning and word2vec. Thanks a lot for the help.
The various gensim
similarity functions are optimized via the pre-calculation of unit-length normed vectors, and that pre-calculation is cached in a way that doesn't expect further training to happen.
As a result, when you first check similarities mid-training, as you've done with your callback code, the cache gets filled with the model's early state – and not refreshed after later training. There's a pending bug (as of gensim-3.8.1
in November 2019) to fix this behavior, in the meantime, you can either:
Word2Vec
model, it should be enough to do: model.wv.vectors_norm = None
. (Some other models require extra steps, see the bug discussion for more details.)