Search code examples
pythongensimword2vecdoc2vec

How to put maximum vocabulary frequency in doc2vec


Doc2vec while creating the vocabulary has possibility to put minimum occurence of the word in documents to be included in vocabulary as parameter min_count.

model = gensim.models.doc2vec.Doc2Vec(vector_size=200, min_count=3, epochs=100,workers=8)

How is it possible to exclude words which appear far too often, with some parameter?

I know that one way is to do this in preprocessing step by manually deleting those words, and counting each, but would be nice to know if there is maybe some built in method to do so, as it gives more space for testing. Many thanks for the answer.


Solution

  • There's no explicit max_count parameter in gensim's Word2Vec.

    If you're sure some tokens are meaningless, you should preprocess your text to eliminate them.

    There is also a trim_rule option that can be passed as model instantiation or build_vocab(), where your own function can discard some words; see the gensim docs at:

    https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec

    Similarly, you could possibly avoid calling build_vocab() directly, and instead call its substeps – but edit the discovered raw-counts dictionary before the vocabulary is finalized. You would want to consult the source code to do this, and could use the code that discards too-infrequent words as a model for your own new additional code.

    The classic sample parameter of Word2Vec also controls a downsampling of high-frequency words, to prevent the model from spending too much relative effort on redundantly training abundant words. The more aggressive (smaller) this value is, the more instances of high-frequency words will be radomly skipped during training. The default of 1e-03 (0.001) is very conservative; in very-large natural language corpuses I've seen good results up to 1e-07 (0.0000001) or 1e-8 (0.00000001) – so in another domain where some lower-meaning tokens are very-frequent, similarly aggressive downsampling is worth trying.

    The newer ns_exponent option changes negative sampling to adjust the relative favoring of less-frequent words. The original word2vec work used a fixed value of 0.75, but some research since has suggested other domains, like recommendation systems, might benefit from other values that are more or less sensitive to actual token frequencies. (The relevant paper is linked in the gensim docs for the ns_exponent parameter.)