Gensim Word2Vec produces different most_similar results through final epoch than end of training

I'm using gensim's Word2Vec for a recommendation-like task with part of my evaluation being the use of callbacks and the most_similar() method. However, I am noticing a huge disparity between the final few epoch callbacks and that of immediately post-training. In fact, the last epoch callback may often appear worthless, while the post training result is as best as could be desired.

My during-training tracking of most similar entries utilizes gensim's CallbackAny2Vec class. It follows the doc example fairly directly and roughly looks like:

class EpochTracker(CallbackAny2Vec):

  def __init__(self):
    self.epoch = 0

  def on_epoch_begin(self, model):
    print("Epoch #{} start".format(self.epoch))

  def on_epoch_end(self, model):
    
    print('Some diagnostics')
    # Multiple terms used in the below
    e = model.wv
    print(e.most_similar(positive=['some term'])[0:3]) # grab the top 3 examples for some term

    print("Epoch #{} end".format(self.epoch))
    self.epoch += 1

As the epochs progress, the most_similar() results given by the callbacks to not seem to indicate an advancement of learning and seem erratic. In fact, often the callback from the first epoch shows the best result.

Counterintuitively, I also have an additional process (not shown) built into the callback that does indicate gradual learning. Following the similarity step, I take the current model's vectors and evaluate them against a down-stream task. In brief, this process is a sklearn GridSearchCV logistic regression check against some known labels.

I find that often the last on_epoch_end callback appears to be garbage. Or perhaps some multi-threading shenanigans. However, if directly after training the model I try the similarity call again:

e = e_model.wv # e_model was the variable assignment of the model overall
print(e.most_similar(positive=['some term'])[0:3])

I tend to get beautiful results that are in agreement with the downstream evaluation task also used in the callbacks, or are at least vastly different than that of the final epoch end.

I suspect I am missing something painfully apparent or most_similar() has an unusual behavior with epoch-end callbacks. Is this a known issue or is my approach flawed?

Solution

What version of Gensim are you using?

In older versions – pre 4.0 if I remember correctly? – the most_similar() operation relies on a cached pre-computed set of unit-normalized word-vectors that in some cases will be frozen when you first try a most_similar().

Thus, incremental updates to vectors won't be reflected in results, unless something happens to flush that cache - which happens at then end of training. But, since mid-training checks weren't an originally-envisioned usage, more-frequent flushing doesn't happen unless forced.

I think if you're sure to use the latest Gensim, the problem may go away - or reviewing this older project issue may provide ideas if you're stuck on an older version: https://github.com/RaRe-Technologies/gensim/issues/2260

(Your other mid-training learning process – if it's accessing the non-normalized per-word vectors directly, rather than via most_similar() – is likely succeeding because it's skipping that normed-vector cache.)