I've got a question about gensim Word2Vec and documentation doesn't help me.
For example in my block of text I have some sentences like:
<Word1> <Word2> <Word3>
<Word1> <Word2> <Word3>
<Word1> <Word2> <Word3>
...
And in some time I have a new sentence like:
<Word1> <Word2> <Word3> <Word4>
How can I detect this situation? (of course Word4 is in dictionary too)
My solutions: 1). I tried to find most similar words for each and see - if the next word is in this is - OK, otherwise - I can find Word4. I mean I will do:
model.most_similar('<Word_i>')
or
model.similar_by_vector('<Word_i>')
And in top of answer list I will get Word_i+1. But it doesn't work! Because I thought that the words in the sentence after training will have quite similar coordinates and in top list Word_i+1 will be for Word_i. But it's wrong. When I checked this solution and trained by all corpus of text I had a situation when Word_2 wasn't in top list for Word_1! My explanation that not the near words have quite similar coordinates, but words with contextual proximity have quite similar coordinates, it's not the same..
2). So my second solution is using doesnt_match(), which takes a list of words, and reports the one word which is furthest from the average of all the words.
print(model.doesnt_match('<Word1> <Word2> <Word3> <Word4>'.split()))
And yes - in this case the answer will be Word4! (so I detect this word) But if I do it with:
print(model.doesnt_match('<Word1> <Word2> <Word3>'.split()))
The answer will be Word2 (for example). And if I again will explore top words for Word1 and Word3 I won't see Word2 in this lists, but this sentence (Word1 Word2 Word3) is normal.
So how can I detect it?
I'm not sure that I understand what's the question here but I'll try to explain word2vec concept and what does most_similar
return and hopefully it will be beneficial.
So, let's consider situation that there are two sentences: <Word1> <Word2> <Word3>
and <Word1> <Word4> <Word3>
. When creating the word2vec model, we take the same number of words left and right of the target (current) word and construct tuples like: (target_word, proximity_word). Let's say we want to observe the situation when the target word is the middle word. So for sentence1 we'll get (<Word2>, <Word1>)
and (<Word2>, <Word3>)
and for sentence2 we'll get (<Word4>, <Word1>)
and (<Word4>, <Word3>)
. This way we tell the model that <Word1>
and <Word3>
are in the context of the . Similarly, <Word1>
and <Word3>
are in the context of <Word4>
. What that means? We can conclude that the <Word2>
and <Word4>
are in some way similar.
So if you call most_similar(<Word2>)
you will not get or but because word 2 and 4 are appearing in the same context. That said, you cannot expect that if you have the sentence <Word1> <Word2> <Word3> <Word4>
and call most_similar(<Word3)
to get vector of <Word4>
. Instead you'll get some word that has appeared in the context of words 1, 2 and 4 (this context window depends on the size we specify before training). I hope this has been helpful and makes word2vec clearer.