I'm training a Word2vec model using Gensim Word2Vec on twitter data. The loss of the model deteriorates in every epoch. The first epoch gives the lowest loss. Why is it so? Code is shared below:
loss_list = []
class callback(CallbackAny2Vec):
def __init__(self):
self.epoch = 0
def on_epoch_end(self, model):
loss = model.get_latest_training_loss()
loss_list.append(loss)
print('Loss after epoch {}: {}'.format(self.epoch, loss))
self.epoch = self.epoch + 1
model = Word2Vec(df['tweet_text'], vector_size=300, window=10, epochs=30, hs=0, negative = 1, compute_loss=True, callbacks=[callback()])
embedding_size = model.wv.vectors.shape[1]
print("embedding size--->", embedding_size)
vocab = model.wv.index_to_key
print("minimum loss {} at epoch {}".format(min(loss_list), loss_list.index(min(loss_list))))
The output is:
Loss after epoch 0: 527066.375
Loss after epoch 1: 1038087.0625
Loss after epoch 2: 1510719.75
Loss after epoch 3: 1936163.875
Loss after epoch 4: 2364015.5
Loss after epoch 5: 2779299.75
Loss after epoch 6: 3183956.25
Loss after epoch 7: 3570054.5
Loss after epoch 8: 3966524.75
Loss after epoch 9: 4335994.5
Loss after epoch 10: 4706316.0
Loss after epoch 11: 5046213.0
Loss after epoch 12: 5410604.5
Loss after epoch 13: 5754962.0
Loss after epoch 14: 6080469.0
Loss after epoch 15: 6428622.5
Loss after epoch 16: 6771707.0
Loss after epoch 17: 7105302.0
Loss after epoch 18: 7400089.0
Loss after epoch 19: 7732032.0
Loss after epoch 20: 8059942.5
Loss after epoch 21: 8408386.0
Loss after epoch 22: 8685176.0
Loss after epoch 23: 8959723.0
Loss after epoch 24: 9242788.0
Loss after epoch 25: 9506676.0
Loss after epoch 26: 9752588.0
Loss after epoch 27: 10013168.0
Loss after epoch 28: 10288152.0
Loss after epoch 29: 10550915.0
embedding size---> 300
minimum loss 527066.375 at epoch 0
Unfortunately, the code which totals-up loss for reporting in the Gensim Word2Vec
model has a number of known bugs & deviations from reasonable user expectations. You can see an overview of the problems, with links to a number of more-specific bugs, in the project's bug tracking issue [#2617][1].
Among other problems, the default loss reported is a running tally across all epochs – you'd have to do extra comparisons, or resets-to-0.0
, to get per-epoch loss. And, insufficient precision in the running tally variables means other inaccuracies that become noticeable in large epochs or large runs.
These bugs don't affect the effectiveness of training, only the accuracy of get_latest_training_loss()
. reporting.
Manually resetting the internal tally to 0.0
at the start of each epoch, from your own callback, may improve the reporting enough for your purposes, if your jobs aren't especially large.
However, other things to note about your apparent setup:
Keep in mind that a full epoch's loss can hint about whether more SGD training will be beneficial on the model's internal training goals, but is not a reliable indicator of the quality of the final word-vectors for other downstream uses. A model with more loss might give better vectors, a model with less loss might (through overfitting) give word-vector that are less generally-useful for typical purposes. So don't rely on loss as a guide to other meta-optimization, only the choice of epochs
/alpha
or potential early-stopping.
min_count=1
is essentially always a mistake with Word2Vec
, giving you not just bad vectors for the words that only appear 1 (or a few) times, but also making the other word-vectors, for more common words, worse than they'd be with a more sensible min_count
choice. This is especially the case if you truly have enough data to justify large vector_size=300
vectors.
The atypical parameter negative=1
is also almost certainly sub-optimal, and window=10
is another deviation from defaults that will usually only make sense if you've got some repeatable quantitative quality evaluation that can assure you it's an improvement over the default.