I want to train part of the corpus first and then based on the embeddings train on the whole corpus. Can I achieve this with gensim skipgram?
I haven't found an API that can pass initial embeddings.
what I want is some thing like
from gensim.models import Word2Vec
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"],
["cat2", "say2", "meow"], ["dog2", "say", "woof"]]
model = Word2Vec(sentences[:2], min_count=1)
X = #construct a new one
model = Word2Vec(sentences, min_count=1, initial_embedding=X)
I'm not sure why you'd want to do this: if you have the whole corpus, and can train on the whole corpus, you're likely to get the best results from whole-corpus training.
And, to the extent there's anything missing from the 2nd-corpus, the 2nd-corpus training will tend to pull vectors for words still training away from words that are no longer in the corpus – causing comparability of vectors within the corpus to decay. (It's only the interleaved tug-of-war between examples including all words that nudges them into positions that are meaningfully related to each other.)
But, keeping that caveat in mind: you can continue to train()
a model with new data. That is:
# initialize & do all default training
model = Word2Vec(sentences[:2], min_count=1)
# now train again even more with a slightly different mix
model.train(sentences, total_examples = len(sentences), epochs=model.epochs)
Note in such a case the model's discovered vocabulary is only based on the original initialization. If there are words only in sentences[0]
, when those sentences are presented to the model that didn't see those words during its initialization, they will be ignored – and never get vectors. (If using your tiny example corpus in this way, the word 'cat' won't get a vector. Again, you really want to train on the largest corpus – or at least use the largest corpus, with a superset of words, 1st.)
Also, a warning will be logged, because the 2nd training will again start the internal alpha
learning-rate at its larger starting value, then gradually decrease it to the final min_alpha
value. To be yo-yo'ing the value like this isn't standard SGD, and usually indicates a user error. But, it might be tolerable depending on your goals – you just need to be aware when you're doing unusual training sequences like this, you're off in experimental/advanced land and have to deal with possible side-effects via your own understanding.