Adding vocabulary and improve word embedding with another model that was built on bigger corpus

I'm new to NLP. I'm currently building a NLP system in a specific domain. After training a word2vec and fasttext model on my documents, I found that the embedding is not really good because I didn't feed enough number of documents (e.g. the embedding can't see that "bar" and "pub" is strongly correlated to each other because "pub" only appears a few in the documents). Later, I found a word2vec model online built on that domain-specific corpus which definitely has a way better embedding (so "pub" is more related to "bar"). Is there any way to improve my word embedding using the model I found? Thanks!

Solution

Word2Vec (and similar) models really require a large volume of varied data to create strong vectors.

But also, a model's vectors are typically only meaningful alongside other vectors that were trained together in the same session. This is both because the process includes some randomness, and the vectors only acquire their useful positions via a tug-of-war with all other vectors and aspects of the model-in-training.

So, there's no standard location for a word like 'bar' - just a good position, within a certain model, given the training data and model parameters and other words co-populating the model.

This means mixing vectors from different models is non-trivial. There are ways to learn a 'translation' that moves vectors from the space of one model to another – but that is itself a lot like a re-training. You can pre-initialize a model with vectors from elsewhere... but as soon as training starts, all the words in your training corpus will start drifting into the best alignment for that data, and gradually away from their original positions, and away from pure comparability with other words that aren't being updated.

In my opinion, the best approach is usually to expand your corpus with more appropriate data, so that it has "enough" examples of every word important to you, in sufficiently varied contexts.

Many people use large free text dumps like Wikipedia articles for word-vector training, but be aware that its style of writing – dry, authoritative reference texts – may not be optimal for all domains. If your problem-area is "business reviews", you'd probably do best finding other review texts. If it's fiction stories, more fictional writing. And so forth. You can shuffle these other text-soruces in with your data to expand the vocabulary coverage.

You can also potentially shuffle in extra repeated examples of your own local data, if you want it to effectively have relatively more influence. (Generally, merely repeating a small number of non-varied examples can't help improve word-vectors: it's the subtle contrasts of different examples that helps. But as a way to incrementally boost the influence of some examples, when there are plenty of examples overall, it can make more sense.)