Say that I'm training a (Gensim) Word2Vec model with min_count=5. The documentation learns us what min_count does:
Ignores all words with total frequency lower than this.
What is the effect of min_count on the context? Lets say that I have a sentence of frequent words (min_count > 5) and infrequent words (min_count < 5), annotated with f and i:
This (f) is (f) a (f) test (i) sentence (i) which (f) is (f) shown (i) here (i)
I just made up which word is frequently used and which word is not for demonstration purposes.
If I remove all infrequent words, we get a completely different context from which word2vec is trained. In this example, your sentence would be "This is a which is", which would then be a training sentence for Word2Vec. Moreover, if you have a lot of infrequent words, words that were originally very far away from each other are now placed within the same context.
Is this the correct interpretation of Word2Vec? Are we just assuming that you shouldn't have too many infrequent words in your dataset (or set a lower min_count threshold)?
Words below the min_count
frequency are dropped before training occurs. So, the relevant context window
is the word-distance among surviving words.
This de facto shrinking of contexts is usually a good thing: the infrequent words don't have enough varied examples to obtain good vectors for themselves. Further, while individually each infrequent word is rare, in total there are lots of them, so these doomed-to-poor-vector rare-words intrude on most other words' training, serving as a sort of noise that makes those word-vectors worse too.
(Similarly, when using the sample
parameter to down-sample frequent words, the frequent words are randomly dropped – which also serves to essentially "shrink" the distances between surviving words, and often improves overall vector quality.)