Search code examples
python-3.xnlpword2vecword-embedding

How did online training work in the Word2vec model using Genism


Using the Genism library, we can load the model and update the vocabulary when the new sentence will be added. That’s means If you save the model you can continue training it later. I checked with sample data, let’s say I have a word in my vocabulary that was previously trained (i.e. “women”). And after that let’s say I have new sentences and using model.build_vocab(new_sentence, update=True) and model.train(new_sentence), the model is updated. Now, in my new_sentence I have some word that already exists(“women”) in the previous vocabulary list and have some new word(“girl”) that not exists in the previous vocabulary list. After updating the vocabulary, I have both old and new words in the corpus. And I checked using model.wv[‘women’], the vector is updated after update and training new sentence. Also, get the word embedding vector for a new word i.e. model.wv[‘girl’]. All other words that were previously trained and not in the new_sentence, those word vectors not changed.

model = Word2Vec(old_sentences, vector_size=100,window=5, min_count=1) 
model.save("word2vec.model")
model = Word2Vec.load("word2vec.model") //load previously save model 
model.build_vocab(new_sentences,update=True,total_examples=model.corpus_count, epochs=model.epochs)   
model.train(new_sentences)

However, just don’t understand the inside depth explanation of how the online training is working. Please let me know if anybody knows in detail. I get the code but want to understand how the online training working in theoretically. Is it re-train the model on the old and new training data from scratch?

Here is the link that I followed: Online training


Solution

  • When you perform a new call to .train(), it only trains on the new data. So only words in the new data can possibly be updated.

    And to the extent that the new data may be smaller, and more idiosyncratic in its word usages, any words in the new data will be trained to only be consistent with other words being trained in the new data. (Depending on the size of the new data, and the training parameters chosen like alpha & epochs, they might be pulled via the new examples arbitrarily far from their old locations - and thus start to lose comparability to words that were trained earlier.)

    (Note also that when providing an different corpus that the original, you shouldn't use a parameter like total_examples=model.corpus_count, reusing model.corpus_count, a value cahced in the model from the earlier data. Rather, parameters should describe the current batch of data.)

    Frankly, I'm not a fan of this feature. It's possible it could be useful to advanced users. But most people drawn to it are likely misuing it, expecting any number of tiny incremental updates to constantly expand & improve the model - when there's no good support for the idea that will reliably happen with naive use.

    In fact, there's reasons to doubt such updates are generally a good idea. There's even an established term for the risk that incremental updates to a neural-network wreck its prior performance: catastrophic forgetting.

    The straightforward & best-grounded approach to updating word-vectors for new expanded data is to re-train from scratch, so all words are on equal footing, and go through the same interleaved training, on the same unified optimization (SGD) schedule. (The new new vectors at the end of such a process will not be in a compatible coordinate space, but should be equivalently useful, or better if the data is now bigger and better.)