I want to learn more about the algebra function I can perform over the word embedding vectors. I know that by cosine similarity I can get the most similar word. But I need to do one more level of inference and get the relations below:
The relation of X1 to X2 is like relation of X3 to X4.
As and example I can say the relation of princess to prince is like women to men. I have X1 to X3 and my problem is how efficiently I can figure out what X4 can be. I tried cosine to absolute difference of vectors but it is not working.
You can look at exactly how the original Google-released word2vec
code solves analogies in its word-analogy.c
code:
https://github.com/tmikolov/word2vec/blob/master/word-analogy.c
If you're more familiar with Python, you can look how the gensim Word2Vec implementation tests analogies, in its accuracy()
method, by reading the analogy "a:b :: c:expected" from the questions-words.txt
file (as provided in the original Google word2vec package), then using b
and c
as positive (added) examples, and a
as a negative example (subtracted), to then find words near the resulting vector:
The operation of the used most_similar()
function, which accepts multiple positive
and negative
examples before returning a list of closest vectors, is seen at: