Search code examples
pythongensimword2vectraining-datacorpus

Combining/adding vectors from different word2vec models


I am using gensim to create Word2Vec models trained on large text corpora. I have some models based on StackExchange data dumps. I also have a model trained on a corpus derived from English Wikipedia.

Assume that a vocabulary term is in both models, and that the models were created with the same parameters to Word2Vec. Is there any way to combine or add the vectors from the two separate models to create a single new model that has the same word vectors that would have resulted if I had combined both corpora initially and trained on this data?

The reason I want to do this is that I want to be able to generate a model with a specific corpus, and then if I process a new corpus later, I want to be able to add this information to an existing model rather than having to combine corpora and retrain everything from scratch (i.e. I want to avoid reprocessing every corpus each time I want to add information to the model).

Are there builtin functions in gensim or elsewhere that will allow me to combine models like this, adding information to existing models instead of retraining?


Solution

  • Generally, only word vectors that were trained together are meaningfully comparable. (It's the interleaved tug-of-war during training that moves them to relative orientations that are meaningful, and there's enough randomness in the process that even models trained on the same corpus will vary in where they place individual words.)

    Using words from both corpuses as guideposts, it is possible to learn a transformation from one space A to the other B, that tries to move those known-shared-words to their corresponding positions in the other space. Then, applying that same transformation to the words in A that aren't in B, you can find B coordinates for those words, making them comparable to other native-B words.

    This technique has been used with some success in word2vec-driven language translation (where the guidepost pairs are known translations), or as a means of growing a limited word-vector set with word-vectors from elsewhere. Whether it'd work well enough for your purposes, I don't know. I imagine it could go astray especially where the two training corpuses use shared tokens in wildly different senses.

    There's a class, TranslationMatrix, that may be able to do this for you in the gensim library. See:

    https://radimrehurek.com/gensim/models/translation_matrix.html

    There's a demo notebook of its use at:

    https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb

    (Whenever practical, doing a full training on a mixed-together corpus, with all word examples, is likely to do better.)