So I have started to learn gensim for both word2vec and doc2vec and it works. The similarity scores actually work really well. For an experiment, however, I wanted to optimize a key word based search algorithm by comparing a single word and getting how similar it is to a piece of text.
What is the best way to do this? I considered averaging the the word vectors of all words in the text (maybe remove fill and stop word first) and and comparing this to the search word? But this really is just intuition, what would be the best way to do this?
Averaging all the word-vectors of a longer text is one crude but somewhat effective way to get a single vector for the full text. The resulting vector might then be usefully comparable to single word-vectors.
The Doc2Vec
modes that train word-vectors into the same 'space' as the doc-vectors – PV-DM (dm=1
), or PV-DBOW if word-training is added (dm=0, dbow_words=1
) – could be considered. The doc-vectors closest to a single word-vector might work for your purposes.
Another technique for calculating a 'closeness' of two sets-of-word-vectors is "Word Mover's Distance" ('WMD'). It's more expensive to calculate than those techniques that reduce a text to a single vector, because it's essentially considering many possible cost-minimizing ways of correlating the sets-of-vectors. I'm not sure how well it works in the degenerate case of one 'text' being just a single word (or very short phrase), but it could be worth trying. (The method wmd_distance()
in gensim offers this.)
I've also seen mention of another calculation, called 'Soft Cosine Similarity', that may be more efficient that WMD but offer similar benefits. It's also now available in gensim; there's a Jupyter notebook intro tutorial as well.