Search code examples
pythonpython-3.xnlpgensimword2vec

Understanding gensim word2vec's most_similar


I am unsure how I should use the most_similar method of gensim's Word2Vec. Let's say you want to test the tried-and-true example of: man stands to king as woman stands to X; find X. I thought that is what you could do with this method, but from the results I am getting I don't think that is true.

The documentation reads:

Find the top-N most similar words. Positive words contribute positively towards the similarity, negative words negatively.

This method computes cosine similarity between a simple mean of the projection weight vectors of the given words and the vectors for each word in the model. The method corresponds to the word-analogy and distance scripts in the original word2vec implementation.

I assume, then, that most_similar takes the positive examples and negative examples, and tries to find points in the vector space that are as close as possible to the positive vectors and as far away as possible from the negative ones. Is that correct?

Additionally, is there a method that allows us to map the relation between two points to another point and get the result (cf. the man-king woman-X example)?


Solution

  • You can view exactly what most_similar() does in its source code:

    https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#L485

    It's not quite "find points in the vector space that are as close as possible to the positive vectors and as far away as possible from the negative ones". Rather, as described in the original word2vec papers, it performs vector arithmetic: adding the positive vectors, subtracting the negative, then from that resulting position, listing the known-vectors closest to that angle.

    That is sufficient to solve man : king :: woman :: ?-style analogies, via a call like:

    sims = wordvecs.most_similar(positive=['king', 'woman'], 
                                 negative=['man'])
    

    (You can think of this as, "start at 'king'-vector, add 'woman'-vector, subtract 'man'-vector, from where you wind up, report ranked word-vectors closest to that point (while leaving out any of the 3 query vectors).")