Search code examples
pythonsimilarityword2vecgensimword-embedding

Word2Vec Python similarity


I made a word embedding with this code:

with open("text.txt",'r') as longFile:
        sentences = []
        single= []
        for line in longFile:
            for word in line.split(" "):
                single.append(word)
            sentences.append(single)
    model = Word2Vec(sentences,workers=4, window=5)

I want now to calculate the similarity between two word and see what are the neighbours of them. What is the difference between model["word"],model.wv.most_similar(), model.similar_by_vector() and model.similarity()? Which one should I use?


Solution

  • Edit: Maybe we should tag gensim here, because it is the library we are using

    If you want to find the neighbours of both you can use model.wv.most_similar() this will give you a dict (top n) for each word and its similarities for a given string (word). This method will calculate the cosine similarity between the word-vectors.

    Note that the other methods you mentioned are deprecated in 3.4.0, use model.wv.similarity() and model.wv.similar_by_vector() instead.

    You can also use model.wv.similar_by_vector() to do the exact same thing but by passing a vector. Eg. model["woman"] would give you such a vector. Actually if you look at the implementation, all the method does is call most_similar()

    def similar_by_vector(self, vector, topn=10, restrict_vocab=None):
       return self.most_similar(positive=[vector], topn=topn, restrict_vocab=restrict_vocab)
    

    Same goes for the similar_by_word() method. I actually don't know why these methods exist in the first place.

    To find a similarity measure between exactly two words you can either use model.wv.similarity() to find the cosine similarity or model.wv.distance() to find the cosine distance between the two.

    To answer your actual question, I would simply compute the similarity between the two instead of comparing the results of most_similar().

    I hope this helps. Look at the docs or the source files to get even more information, the code documentation is pretty good I think.