Search code examples
pythonmachine-learninggensimword2vec

How can I use a pretrained embedding to gensim skipgram model?


I want to train part of the corpus first and then based on the embeddings train on the whole corpus. Can I achieve this with gensim skipgram?

I haven't found an API that can pass initial embeddings.

what I want is some thing like

from gensim.models import Word2Vec
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"],
             ["cat2", "say2", "meow"], ["dog2", "say", "woof"]]
model = Word2Vec(sentences[:2], min_count=1)
X = #construct a new one
model = Word2Vec(sentences, min_count=1, initial_embedding=X)

Solution

  • I'm not sure why you'd want to do this: if you have the whole corpus, and can train on the whole corpus, you're likely to get the best results from whole-corpus training.

    And, to the extent there's anything missing from the 2nd-corpus, the 2nd-corpus training will tend to pull vectors for words still training away from words that are no longer in the corpus – causing comparability of vectors within the corpus to decay. (It's only the interleaved tug-of-war between examples including all words that nudges them into positions that are meaningfully related to each other.)

    But, keeping that caveat in mind: you can continue to train() a model with new data. That is:

    # initialize & do all default training
    model = Word2Vec(sentences[:2], min_count=1)
    # now train again even more with a slightly different mix
    model.train(sentences, total_examples = len(sentences), epochs=model.epochs)
    

    Note in such a case the model's discovered vocabulary is only based on the original initialization. If there are words only in sentences[0], when those sentences are presented to the model that didn't see those words during its initialization, they will be ignored – and never get vectors. (If using your tiny example corpus in this way, the word 'cat' won't get a vector. Again, you really want to train on the largest corpus – or at least use the largest corpus, with a superset of words, 1st.)

    Also, a warning will be logged, because the 2nd training will again start the internal alpha learning-rate at its larger starting value, then gradually decrease it to the final min_alpha value. To be yo-yo'ing the value like this isn't standard SGD, and usually indicates a user error. But, it might be tolerable depending on your goals – you just need to be aware when you're doing unusual training sequences like this, you're off in experimental/advanced land and have to deal with possible side-effects via your own understanding.