Search code examples

Gensim: KeyError: "word not in vocabulary"

I have a trained Word2vec model using Python's Gensim Library. I have a tokenized list as below. The vocab size is 34 but I am just giving few out of 34:

b = ['let',


model = gensim.models.Word2Vec(b,min_count=1,size=32)
### prints: Word2Vec(vocab=34, size=32, alpha=0.025) ####

if I try to get the similarity score by doing model['buy'] of one the words in the list, I get the

KeyError: "word 'buy' not in vocabulary"

Can you guys suggest me what I am doing wrong and what are the ways to check the model which can be further used to train PCA or t-sne in order to visualize similar words forming a topic? Thank you.


  • The first parameter passed to gensim.models.Word2Vec is an iterable of sentences. Sentences themselves are a list of words. From the docs:

    Initialize the model from an iterable of sentences. Each sentence is a list of words (unicode strings) that will be used for training.

    Right now, it thinks that each word in your list b is a sentence and so it is doing Word2Vec for each character in each word, as opposed to each word in your b. Right now you can do:

    model = gensim.models.Word2Vec(b,min_count=1,size=32)
    array([  7.42487283e-03,  -5.65282721e-03,   1.28707094e-02, ... ]

    To get it to work for words, simply wrap b in another list so that it is interpreted correctly:

    model = gensim.models.Word2Vec([b],min_count=1,size=32)
    array([-0.01331611,  0.00496594, -0.00165093, -0.01444992,  0.01393849, ... ]