Search code examples
pythonmathgensimword-embeddingsubsampling

Gensim word2vec downsampling sample=0


Does sample= 0 in Gensim word2vec mean that no downsampling is being used during my training? The documentation says just that

"useful range is (0, 1e-5)"

However putting the threshold to 0 would cause P(wi) to be equal to 1, meaning that no word would be discarded, am I understanding it right or not?

I'm working on a relatively small dataset of 7597 Facebook posts (18945 words) and my embeddings perform far better using sample= 0rather than anything else within the recommended range. Is there any particular reason? Text size?


Solution

  • That seems an incredibly tiny dataset for Word2Vec training. (Is that only 18945 unique words, or 18945 words total, so hardly more than 2 words per post?)

    Sampling is most useful on larger datasets - where there are so many examples of common words, more training examples of them aren't adding much – but they are stealing time from, and overwieghting those words' examples compared to, other less-frequent words.

    Yes, sample=0 means no down-sampling.