Search code examples
pythonnlpgensimphrase

Gensim phrases model vocabulary length does not correspond to amount of iteratively added documents


I iteratively apply the...

bigram.add_vocab(<List of List with Tokens>)

method in order to update a...

bigram = gensim.models.phrases.Phrases(min_count=bigramMinFreq, threshold=10.0)

Gensim phrases model. With each iteration up to ~10'000 documents are added. Therefore my intuition is that the Phrases model grows with each added document set. I check this intuition by checking the length of the bigram vocabulary with...

len(bigram.vocab))

Furthermore I also check the amount of phrasegrams in the freezed Phrase model with...

bigram_freezed = bigram.freeze()
len(bigram_freezed.phrasegrams)

A resulting output looks as follows:

Data of directory:  000  is loaded
Num of Docs: 97802
Updated Bigram Vocab is:  31819758
Amount of phrasegrams in freezed bigram model:  397554
-------------------------------------------------------
Data of directory:  001  
Num of Docs: 93368
Updated Bigram Vocab is:  17940420
Amount of phrasegrams in freezed bigram model:  429162
-------------------------------------------------------
Data of directory:  002  
Num of Docs: 87265
Updated Bigram Vocab is:  36120292
Amount of phrasegrams in freezed bigram model:  661023
-------------------------------------------------------
Data of directory:  003
Num of Docs: 55852
Updated Bigram Vocab is:  20330876
Amount of phrasegrams in freezed bigram model:  604504
-------------------------------------------------------
Data of directory:  004
Num of Docs: 49390
Updated Bigram Vocab is:  31101880
Amount of phrasegrams in freezed bigram model:  745827
-------------------------------------------------------
Data of directory:  005
Num of Docs: 56258
Updated Bigram Vocab is:  19236483
Amount of phrasegrams in freezed bigram model:  675705
-------------------------------------------------------
...

As can be seen neither the bigram vocab count nor the phrasegram count of the freezed bigram model is continuously increasing. I expected both counts to increase with added documents.

Do I not understand what phrase.vocab and phraser.phrasegrams are referring to? (if needed I can add the whole corrsponding Jupyter Notebook cell)


Solution

  • By default, to avoid using an unbounded amount of RAM, the Gensim Phrases class uses a default parameter max_vocab_size=40000000, per the source code & docs at:

    https://radimrehurek.com/gensim/models/phrases.html#gensim.models.phrases.Phrases

    Unfortunately, the mechanism behind this cap is very crude & non-intuitive. Whenever the tally of all known keys in they survey-dict (which includes both unigrams & bigrams) hits this threshold (default 40,000,000), a prune operation is performed that discards all token counts (unigrams & bigrams) at low-frequencies until the total unique-keys is under the threshold. And, it sets the low-frequency floor for future prunes to be at least as high as was necessary for this prune.

    For example, the 1st time this is hit, it might need to discard all the 1-count tokens. And due to the typical Zipfian distribution of word-frequencies, that step along might not just get the total count of known tokens slightly under the threshold, but massively under the threshold. And, any subsequent prune will start by eliminated at least everything with fewer than 2 occurrences.

    This results in the sawtooth counts you're seeing. When the model can't fit in max_vocab_size, it overshrinks. It may do this many times in the course of processing a very-large corpus. As a result, final counts of lower-frequency words/bigrams can also be serious undercounts - depending somewhat arbitrarily on whether a key's counts survived the various prune-thresholds. (That's also influenced by where in the corpus a token appears. A token that only appears in the corpus after the last prune will still have a precise count, even if it only appears once! Although rare tokens that appeared any number of times could be severely undercounted, if they were always below the cutoff at each prior prune.)

    The best solution would be to use a precise count that uses/correlates some spillover storage on-disk, to only prune (if at all) at the very end, ensuring only the truly-least-frequent keys are discarded. Unfortunately, Gensim's never implemented that option.

    The next-best, for many cases, could be to use a memory-efficient approximate counting algorithm, that vaguely maintains the right magnitudes of counts for a much-larger number of keys. There's been a litte work in Gensim on this in the past, but not yet integrated with the Phrases functionality.

    That leaves you with the only practical workaround in the short term: change the max_vocab_size parameter to be larger.

    You could try setting it to math.inf (might risk lower performance due to int-vs-float comparisons) or sys.maxsize – essentially turning off the pruning entirely, to see if your survey can complete without exhausting your RAM. But, you might run out of memory anyway.

    You could also try a larger-but-not-essentially-infinite cap – whatever fits in your RAM – so that far less pruning is done. But you'll still see the non-intuitive decreases in total counts, sometimes, if in fact the threshold is ever enforced. Per the docs, a very rough (perhaps outdated) estimate is that the default max_vocab_size=40000000 consumes about 3.6GB at peak saturation. So if you've got a 64GB machine, you could possibly try a max_vocab_size thats 10-14x larger than the default, etc.