Search code examples
pythonnlpgensimsimilarity

'word not in the vocabulary' when evaluating similarity using Gensim Word2Vec.most_similar method


Through the method

gensim.models.Word2Vec.most_similar

I get the top-N most similar words.

I trained a model with a list of sentences like

list_of_list = [["i like going to the beach"],
                ["the war is over"], 
                ["we are all made of stars"],  
                         ...
                ["i don't know what to do"]] 
model = gensim.models.Word2Vec(list_of_list, size=100, window=longest_list, min_count=2)

suggestions = model.most_similar("I don't know what to do", topn=10)       

and I wanted to evaluate phrases similarity.

If for example I run

suggestions = model.most_similar("I don't know what to do", topn=10)       

It works correctly.

But if I give a subquery like "to the beach" or "what to do", it returns an error message because the sub-phrase is not in the vocabulary.

 "word 'to the beach' not in vocabulary"

How can I solve this issue without training again the model? How can the model identify the most similar phrases based on a new phrase, not necessary a subphrase?


Solution

  • It seems that you are not training the Word2Vec model correctly. Sentences should be lists of words not list of single strings. So, if you change it to:

    list_of_list = [["i like going to the beach"],
                    ["the war is over"], 
                    ["we are all made of stars"],  
                             ...
                    ["i don't know what to do"]]
    
    list_for_training = [sent[0].split() for sent in list_of_list]
    

    and use list_for_training as the first parameter of the constructor of Word2Vec.

    Similarly, when calling most_similar method, provide a list of strings instead of a string:

    suggestions = model.most_similar("I don't know what to do".split(), topn=10)  
    

    or

    suggestions = model.most_similar("to the beach".split(), topn=10)