machine-learning nlp word2vec gensim word-embedding

Gensim word embedding training with initial values

I have a dataset with documents separated into different years, and my objective is to train an embedding model for each year's data, while at the same time, the same word appearing in different years will have similar vector representations. Like this: for word 'compute', its vector in year 1 is

[0.22, 0.33, 0.20]

and in year 2 it's something around:

[0.20, 0.35, 0.18]

Is there a way to accomplish this? For example, train the model of year 2 with both initial values (if the word is trained already in year 1, modify its vector) and randomness (if this is a new word for the corpus).

Solution

I think the easiest solution is to save the embeddings after training on the first data set, then load the trained model and continue training for the second data set. This way you shouldn't expect the embeddings to drift away from the saved state much (unless your data sets are very different).

It would also make sense to create a single vocabulary from all documents: vocabulary words that aren't present in a particular document will get some random representation, but still it will be a working word2vec model.

Example from the documentation:

>>> model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
>>> model.save(fname)
>>> model = Word2Vec.load(fname)  # continue training with the loaded model