machine-learning nlp gensim word2vec doc2vec

word2vec, using document body or keywords as training corpus

I would like to train a word2vec model using what is an unordered list of keywords and categories for each document. Therefore my vocabulary is quite small around 2.5k tokens.

Would the performance be improved if at the training step, I used actual sentences from the document?

From example:

doc_keywords = ['beach', 'holiday', 'warm']
doc_body = 'Going on a beach holiday it can be very warm'

If there is a benefit to using the full documents, could someone also explain why this is the case?

Since the model predicts the next word in a document, what would be the benefit to it learning very -> warm as two words which often come together, given that very is not in my vocabulary.

Solution

Your dataset seems quite small – perhaps too small to expect good word2vec vectors. But, a small dataset at least means it shouldn't take too much time to try things in many different ways.

So, the best answer (and the only one that truly takes into account whatever uniqueness might be in your data & project goals): do you get better final word-vectors, for your project-specific needs, when training on just the keywords, or the longer documents?

Two potential sources of advantage from using the full texts:

Those less-interesting words might still help tease-out subtleties of meaning in the full vector space. For example, a contrast between 'warm' and 'hot' might become clearer when those words are forced to predict other related words that co-occur with each in different proportions. (But, such qualities of word2vec vectors require lots of subtly-varied real usage examples – so such a benefit might not be possible in a small dataset.)
Using the real texts preserves the original proximity-influences – words nearer each other have more influence. The keywords-only approach might be scrambling those original proximities, depending on how you're turning raw full texts into your reduced keywords. (In particular, you definitely do not want to always report keywords in some database-sort order – as that would tend to create a spurious influence between keywords that happen to sort next-to each other, as opposed to appear next-to each other in natural language.)

On the other hand, including more words makes the model larger & the training slower, which might limit the amount of training or experiments you can run. And, keeping very-rare words – that don't have enough varied usage examples to get good word-vectors themselves – tends to act like 'noise' that dilutes the quality of other word-vectors. (That's why dropping rare words, with a min_count similiar to its default of 5 – or larger in larger corpuses – is almost always a good idea.)

So, there's no sure answer for which will be better: different factors, and other data/parameter/goals choices, will pull different ways. You'll want to try it in multiple ways.