Search code examples
pythongensimword2vec

Using gensim most_similar function on a subset of total vocab


I am trying to use the gensim word2vec most_similar function in the following way:

wv_from_bin.most_similar(positive=["word_a", "word_b"])

So basically, I multiple query words and I want to return the most similar outputs, but from a finite set. i.e. if vocab is 2000 words, then I want to return the most similar from a set of say 100 words, and not all 2000.

e.g.

Vocab:
word_a, word_b, word_c, word_d, word_e ... words_z

Finite set:
word_d, word_e, word_f

most_similar on whole vocab

wv_from_bin.most_similar(positive=["word_a", "word_b"])
output = ['word_d', 'word_f', 'word_g', 'word_x'...]

desired output

finite_set = ['word_d', 'word_e', 'word_f']
wv_from_bin.most_similar(positive=["word_a", "word_b"], finite_set) <-- some way of passing the finite set

output = ['word_d', 'word_f']

Solution

  • Depending on your specific patterns of use, you have a few options.

    If you want to confine your results to a contiguous range of words in the KeyedVectors instance, a few optional parameters can help.

    Most often, people want to confine results to the most frequent words. Those are generally those with the best-trained word-vectors. (When you get deep into less-frequent words, the few training examples tend to make their vectors somewhat more idiosyncratic – both from randomization that's part of the algorithm, and from any ways the limited number of examples don't reflect the word's "true" generalizable sense in the wider world.)

    Using the optional parameter restrict_vocab, with an integer value N, will limit the results to just the first N words in the KeyedVectors (which by usual conventions are those that were most-frequent in the training data). So for example, adding restrict_vocab=10000 to a call against a set-of-vectors with 50000 words will only retun the most-similar words from the 1st 10000 known words. Due to the effect mentioned above, these will often be the most reliable & sensible results - while nearby words from the longer-tail of low-frequency words are more likely to seem a little out of place.

    Similarly, instead of restrict_vocab, you can use the optional clip_start & clip_end parameters to limit results to any other contiguous range. For example, adding clip_start=100, clip_end=1000 to your most_similar() call will only return results from the 900 words in that range (leaving out the 100 most-common words in the usual case). I suppose that might be useful if you're finding the most-frequent words to be too generic – though I haven't noticed that being a typical problem.

    Based on the way the underlying bulk-vector libraries work, both of the above options efficiently calculate only the needed similarities before sorting out the top-N, using native routines that might achieve nice parallelism without any extra effort.

    If your words are a discontiguous mix throughout the whole KeyedVectors, there's no built-in support for limiting the results.

    Two options you could consider include:

    • Especially if you repeatedly search against the exact same subset of words, you could try creating a new KeyedVectors object with just those words - then every most_similar() against that separate set is just what you need. See the constructor & add_vector() or add_vectors() methods in the KeyedVectors docs for how that could be done.

    • Requesting a larger set of results, then filtering your desired subset. For example, if you supply topn=len(wv_from_bin), you'll get back every word, ranked. You could then filter those down to only your desired subset. This does extra work, but that might not be a concern depending on your model size & required throughput. For example:

    finite_set = set(['word_d', 'word_e', 'word_f'])  # set for efficient 'in'
    all_candidates = wv_from_bin.most_similar(positive=["word_a", "word_b"],
                                              topn=len(vw_from_bin))
    filtered_results = [word_sim for word_sim in all_candidates if word_sim[0] in finite_set]
    
    • You could save a little of the cost of the above by getting all the similarities, unsorted, using the topn=None option - but then you'd still have to subset those down to your words-of-interest, then sort yourself. But you'd still be paying the cost of all the vector-similarity calculations for all words, which in typical large-vocabularies is more of the runtime than the sort.

    If you were tempted to iterate over your subset & calculate the similarities 1-by-1, be aware that can't take advantage of the math library's bulk vector operations – which use vector CPU operations on large ranges of the underlying data – so will usually be a lot slower.

    Finally, as an aside: if your vocabulary is truly only ~2000 words, youre far from the bulk of data/words for which word2vec (and dense embedding word-vectors in general) usually shine. You may be disappointed in results unless you get a lot more data. (And in the meantime, such small vocabs may have problems effectively training typical word2vec dimensionalities (vector_size) of 100, 300, or more. (Using smaller vector_size, when you have a smaller vocab & less training data, can help a bit.)

    On the other hand, if you're in some domain other than real-language texts with an inherently limited unique vocabulary – like say category-tags or product-names or similar – and you have the chance to train your own word-vectors, you may want to try a wider range of training parameters than the usual defaults. Some recommendation-type apps may benefit from values very different from the ns_exponent default, & if the source data's token-order is arbitrary, rather than meaningful, using a giant window or setting shrink_windows=False will deemphasize immediate-neighbors.