Search code examples
gensimdoc2vec

Why is Doc2Vec.scale_vocab(...)['memory']['vocab'] divided by 700 to obtain vocabulary size?


From the Doc2Vec wikipedia tutorial at https://github.com/RaRe-Technologies/gensim/blob/master/docs/notebooks/doc2vec-wikipedia.ipynb

for num in range(0, 20):
    print('min_count: {}, size of vocab: '.format(num), 
           pre.scale_vocab(min_count=num, dry_run=True)['memory']['vocab']/700)

Output is:

min_count: 0, size of vocab: 8545782.0
min_count: 1, size of vocab: 8545782.0
min_count: 2, size of vocab: 4227783.0
min_count: 3, size of vocab: 3008772.0
min_count: 4, size of vocab: 2439367.0
min_count: 5, size of vocab: 2090709.0
min_count: 6, size of vocab: 1856609.0
min_count: 7, size of vocab: 1681670.0
min_count: 8, size of vocab: 1546914.0
min_count: 9, size of vocab: 1437367.0
min_count: 10, size of vocab: 1346177.0
min_count: 11, size of vocab: 1267916.0
min_count: 12, size of vocab: 1201186.0
min_count: 13, size of vocab: 1142377.0
min_count: 14, size of vocab: 1090673.0
min_count: 15, size of vocab: 1043973.0
min_count: 16, size of vocab: 1002395.0
min_count: 17, size of vocab: 964684.0
min_count: 18, size of vocab: 930382.0
min_count: 19, size of vocab: 898725.0

In the original paper, they set the vocabulary size 915,715. It seems similar size of vocabulary if we set min_count = 19. (size of vocab = 898,725)

700 seems rather arbitrary, and I don't see any mentioning of this in the docs.


Solution

  • It doesn't make sense, but here's the reason:

    scale_vocab() (via the use of an internal estimate_memory() function) returns a dict with a bunch of roughly-estimated values of the amount of memory the model will need, in bytes, for the given min_count. Those estimates are based on the idea each word in the model's vocab dict will take about 700 bytes in an HS model (where it includes some extra huffman-coding information), or just 500 bytes in a negative-sampling model. See:

    https://github.com/RaRe-Technologies/gensim/blob/5f630816f8cde46c8408244fb9d3bdf7359ae4c2/gensim/models/word2vec.py#L1343

    (These are very rough estimates based on a series of ad-hoc tests I ran, and might vary a lot in other environments – but usually the vocab isn't the biggest factor of model memory-use, so precision here isn't that important.)

    It appears this notebook is attempting to back-calculate what the exact retained vocab-size was calculated to be, given the dry_run=True trial numbers, from the memory estimate.

    But, it really doesn't have to do that. The same dict-of-results from scale_vocab() that includes the memory-estimates also includes, at a top-level retain_total key, the exact calculated retained vocab-size. See:

    https://github.com/RaRe-Technologies/gensim/blob/5f630816f8cde46c8408244fb9d3bdf7359ae4c2/gensim/models/word2vec.py#L723

    So, the notebook could be improved.