python nlp gensim word2vec word-embedding

gensim word2vec vocabulary size fluctuates up & down as corpus grows despite `max_vocab_size` setting

I am training word embeddings using gensim Word2Vec model with a multi-million sentence corpus that is made of 3 million unique tokens with max_vocab_size = 32_000.

Even though I set min_count = 1, model creates a vocabulary of far less than 32_000. When I use a subset of the corpus, vocabulary size increases!

In order to troubleshoot, I set up an experiment where I control the size of vocabulary with different sized subcorpus. The size of the vocabulary flactuates!

You can re-produce with the code below:

import string
import numpy as np
from gensim.models import Word2Vec

letters = list(string.ascii_lowercase)

# creating toy sentences
sentences = []
number_of_sentences = 100_000

for _ in range(number_of_sentences):
    number_of_tokens = np.random.randint(1, 15, 1)[0]
    sentence = []
    for i in range(number_of_tokens):
        token = ""
        len_of_token = np.random.randint(1, 5, 1)[0]
        for j in range(len_of_token):
            token += np.random.choice(letters)
        sentence.append(token)
    sentences.append(sentence)

# Sanity check to ensure that input data is a list of list of strings(tokens)
for _ in range(4):
    print(np.random.choice(sentences))

# collecting some statistics about tokens
flattened = []
for sublist in sentences:
    for item in sublist:
        flattened.append(item)
        
unique_tokens = {}
for token in flattened:
    if token not in unique_tokens:
        unique_tokens[token] = len(unique_tokens)

print('Number of tokens:', f'{len(flattened):,}')
print('Number of unique tokens:', f'{len(unique_tokens):,}')


# gensim model
vocab_size = 32_000
min_count = 1
collected_data = []
for num_sentence in range(5_000, number_of_sentences + 5_000, 5_000):
    model = Word2Vec(min_count=min_count, max_vocab_size= vocab_size)
    model.build_vocab(sentences[:num_sentence])

    collected_data.append((num_sentence, len(model.wv.key_to_index)))

for duo in collected_data:
    print('Vocab size of', duo[1], 'for', duo[0], 'number of sentences!')

Output:

['cpi', 'bog', 'df', 'tgi', 'xck', 'kkh', 'ktw', 'ay']
['z', 'h', 'w', 'jek', 'w', 'dqm', 'wfb', 'agq', 'egrg']
['kgwb', 'lahf', 'kzx', 'd', 'qdok', 'xka', 'hbiz', 'bjo', 'fvk', 'j', 'hx']
['old', 'c', 'ik', 'n', 'e', 'n', 'o', 'r', 'ehx', 'dlud', 'd']

Number of tokens: 748,383
Number of unique tokens: 171,485

Vocab size of 16929 for 5000 number of sentences!
Vocab size of 30314 for 10000 number of sentences!
Vocab size of 19017 for 15000 number of sentences!
Vocab size of 31394 for 20000 number of sentences!
Vocab size of 19564 for 25000 number of sentences!
Vocab size of 31831 for 30000 number of sentences!
Vocab size of 19543 for 35000 number of sentences!
Vocab size of 31744 for 40000 number of sentences!
Vocab size of 19536 for 45000 number of sentences!
Vocab size of 31642 for 50000 number of sentences!
Vocab size of 18806 for 55000 number of sentences!
Vocab size of 31255 for 60000 number of sentences!
Vocab size of 18497 for 65000 number of sentences!
Vocab size of 31166 for 70000 number of sentences!
Vocab size of 18142 for 75000 number of sentences!
Vocab size of 30886 for 80000 number of sentences!
Vocab size of 17693 for 85000 number of sentences!
Vocab size of 30390 for 90000 number of sentences!
Vocab size of 17007 for 95000 number of sentences!
Vocab size of 30196 for 100000 number of sentences!

I tried increasing min_count but it did not help this flactuation of vocabulary size. What am I missing?

Solution

In Gensim, the max_vocab_size parameter is a very crude mechanism to limit RAM usage during the initial scan of the training corpus to discover the vocabulary. You should only use this parameter if it's the only way to work around RAM problems.

Essentially: try without using max_vocab_size. If you want control over which words are retained, use alternate parameters like min_count (to discard words less-frequent than a certain threshold) or max_final_vocab (to take no more than a set number of the most-frequent words).

If and only if you hit out-of-memory errors (or massive virtual-memory swapping), then consider using max_vocab_size.

But even then, because of the way it works, you still wouldn't want to set max_vocab_size to the actual final size you want. Instead, you should set it to some value much, much larger - but just small enough to not exhaust your RAM.

This allows the most accurate possible word-counts before other parameters (like min_count & max_final_vocab) are applied.

If you instead use a low max_vocab_size, the running survey will prematurely trim the counts any time the number of known words reaches that value. That is, as soon as the interim count reaches that many entries, say max_vocab_size=32000, many of the least-frequent counts are forgotten to cap memory usage (and more each time the threshold is reached).

That makes all final counts approximate (based on how often a term missed the cutoff), and means the final number of unique tokens in the full survey will be some value even less than max_vocab_size, somewhat arbitrarily based on how recently a forgetting-trim was triggered. (Hence, the somewhat random, but always lower than max_vocab_size, counts seen in your experiment output.)

So: max_vocab_size is unlikely to do what most people want, or in a predictable way. Still, it can help a fuzzy survey complete for extreme corpora where unique terms would otherwise overflow RAM.

Separately: min_count=1 is usually a bad idea in word2vec, as words that lack sufficient varied usage examples won't themselves get good word-vectors, but leaving all such poorly-represented words in the training data tends to serve as noise that dilutes (& delays) what can be learned about adequately-frequent words.