'word not in the vocabulary' when evaluating similarity using Gensim Word2Vec.most_similar method

Through the method

gensim.models.Word2Vec.most_similar

I get the top-N most similar words.

I trained a model with a list of sentences like

list_of_list = [["i like going to the beach"],
                ["the war is over"], 
                ["we are all made of stars"],  
                         ...
                ["i don't know what to do"]] 
model = gensim.models.Word2Vec(list_of_list, size=100, window=longest_list, min_count=2)

suggestions = model.most_similar("I don't know what to do", topn=10)

and I wanted to evaluate phrases similarity.

If for example I run

suggestions = model.most_similar("I don't know what to do", topn=10)

It works correctly.

But if I give a subquery like "to the beach" or "what to do", it returns an error message because the sub-phrase is not in the vocabulary.

 "word 'to the beach' not in vocabulary"

How can I solve this issue without training again the model? How can the model identify the most similar phrases based on a new phrase, not necessary a subphrase?

Solution

It seems that you are not training the Word2Vec model correctly. Sentences should be lists of words not list of single strings. So, if you change it to:

list_of_list = [["i like going to the beach"],
                ["the war is over"], 
                ["we are all made of stars"],  
                         ...
                ["i don't know what to do"]]

list_for_training = [sent[0].split() for sent in list_of_list]

and use list_for_training as the first parameter of the constructor of Word2Vec.

Similarly, when calling most_similar method, provide a list of strings instead of a string:

suggestions = model.most_similar("I don't know what to do".split(), topn=10)

suggestions = model.most_similar("to the beach".split(), topn=10)