Search code examples
numpymatrixword2vecalgebraword-embedding

Word Embedding Relations


I want to learn more about the algebra function I can perform over the word embedding vectors. I know that by cosine similarity I can get the most similar word. But I need to do one more level of inference and get the relations below:

The relation of X1 to X2 is like relation of X3 to X4.

As and example I can say the relation of princess to prince is like women to men. I have X1 to X3 and my problem is how efficiently I can figure out what X4 can be. I tried cosine to absolute difference of vectors but it is not working.


Solution

  • You can look at exactly how the original Google-released word2vec code solves analogies in its word-analogy.c code:

    https://github.com/tmikolov/word2vec/blob/master/word-analogy.c

    If you're more familiar with Python, you can look how the gensim Word2Vec implementation tests analogies, in its accuracy() method, by reading the analogy "a:b :: c:expected" from the questions-words.txt file (as provided in the original Google word2vec package), then using b and c as positive (added) examples, and a as a negative example (subtracted), to then find words near the resulting vector:

    https://github.com/RaRe-Technologies/gensim/blob/5f630816f8cde46c8408244fb9d3bdf7359ae4c2/gensim/models/keyedvectors.py#L697

    The operation of the used most_similar() function, which accepts multiple positive and negative examples before returning a list of closest vectors, is seen at:

    https://github.com/RaRe-Technologies/gensim/blob/5f630816f8cde46c8408244fb9d3bdf7359ae4c2/gensim/models/keyedvectors.py#L290