Through the method
gensim.models.Word2Vec.most_similar
I get the top-N most similar words.
I trained a model with a list of sentences like
list_of_list = [["i like going to the beach"],
["the war is over"],
["we are all made of stars"],
...
["i don't know what to do"]]
model = gensim.models.Word2Vec(list_of_list, size=100, window=longest_list, min_count=2)
suggestions = model.most_similar("I don't know what to do", topn=10)
and I wanted to evaluate phrases similarity.
If for example I run
suggestions = model.most_similar("I don't know what to do", topn=10)
It works correctly.
But if I give a subquery like "to the beach"
or "what to do"
, it returns an error message because the sub-phrase is not in the vocabulary.
"word 'to the beach' not in vocabulary"
How can I solve this issue without training again the model? How can the model identify the most similar phrases based on a new phrase, not necessary a subphrase?
It seems that you are not training the Word2Vec
model correctly. Sentences should be lists of words not list of single strings. So, if you change it to:
list_of_list = [["i like going to the beach"],
["the war is over"],
["we are all made of stars"],
...
["i don't know what to do"]]
list_for_training = [sent[0].split() for sent in list_of_list]
and use list_for_training
as the first parameter of the constructor of Word2Vec
.
Similarly, when calling most_similar
method, provide a list of strings instead of a string:
suggestions = model.most_similar("I don't know what to do".split(), topn=10)
or
suggestions = model.most_similar("to the beach".split(), topn=10)