Search code examples
pythonnlpgensimword2vecword-embedding

[Word2Vec][gensim] Handling missing words in vocabulary with the parameter min_count


Some similar questions have been asked regarding this topic, but I am not really satisfied with the replies so far; please excuse me for that first.

I'm using the function Word2Vec from the python library gensim.

My problem is that I can't run my model on every word of my corpus as long as I set the parameter min_count greater than one. Some would say it's logic cause I choose to ignore the words appearing only once. But the function is behaving weird cause it gives an error saying word 'blabla' is not in the vocabulary, whereas this is exactly what I want ( I want this word to be out of the vocabulary).

I can imagine this is not very clear, then find below a reproducible example:

import gensim
from gensim.models import Word2Vec

# My corpus
corpus=[["paris","not","great","city"],
       ["praha","better","great","than","paris"],
       ["praha","not","country"]]

# Load a pre-trained model - The orignal one based on google news 
model_google = gensim.models.KeyedVectors.load_word2vec_format(r'GoogleNews-vectors-negative300.bin', binary=True)

# Initializing our model and upgrading it with Google's 
my_model = Word2Vec(size=300, min_count=2)#with min_count=1, everything works fine
my_model.build_vocab(corpus)
total_examples = my_model.corpus_count
my_model.build_vocab([list(model_google.vocab.keys())], update=True)
my_model.intersect_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True, lockf=1.0)
my_model.train(corpus, total_examples=total_examples, epochs=my_model.iter)

# Show examples
print(my_model['paris'][0:10])#works cause 'paris' is present twice
print(my_model['country'][0:10])#does not work cause 'country' appears only once

You can find Google's model there for example, but feel free to use any model or just do without, this is not the point of my post.

As notified in the commentaries of the code: running the model on 'paris' works but not on 'country'. And of course, if I set the parameter min_count to 1, everything works fine.

I hope it is clear enough.

Thanks.


Solution

  • It is supposed to throw an error if you ask for a word that's not present because you chose not to learn vectors for rare words, like 'country' in your example. (And: such words with few examples usually don't get good vectors, and retaining them can worsen the vectors for remaining words, so a min_count as large as you can manage, and perhaps much larger than 1, is usually a good idea.)

    The fix is to do one of the following:

    1. Don't ask for words that aren't present. Check first, via something like Python's in operator. For example:
    if 'country' in my_model:
        print(my_model['country'][0:10])
    else: 
        pass  # do nothing, since `min_count=2` means there's no 'country' vector
    
    1. Catch the error, falling back to whatever you want to happen for absent words:
    try:
        print(my_model['country'][0:10])
    except:
        pass  # do nothing, or perhaps print an error, whatever
    
    1. Change to using a model that always returns something for any word, like FastText – which will try to synthesize a vector for unknown words, using subwords learned during training. (It might be garbage, it might be pretty good if the unknown word is highly similar to known words in characters & meaning, but for some uses it's better than nothing.)