Search code examples

word2vec - KeyError: "word X not in vocabulary"

Using the Word2Vec implementation of the module gensim in order to construct word embeddings for the sentences I do have in a plain text file. Despite the word happy is defined in the vocabulary, getting the error KeyError: "word 'happy' not in vocabulary". Tried to apply the given the answers to a similar question, but did not work. Hence, posted my own question.

Here is the code:

    data = []
    with open(TXT_PATH, 'r', encoding='utf-8') as txt_file:
        for line in txt_file:
            for part in line.split(' '):

    # When I debug, both of the words 'happy' and 'birthday' exist in the variable 'data'
    word2vec = Word2Vec(data, min_count=5, size=10000, window=5, workers=4)

    # Print result
    word_1 = 'happy'
    word_2 = 'birthday'
    print(f'Similarity between {word_1} and {word_2} thru word2vec: {word2vec.similarity(word_1, word_2)}')
except Exception as err:
    print(f'An error happened! Detail: {str(err)}')


  • When you get a "not in vocabulary" error like this from Word2Vec, you can trust it: 'happy' really isn't in the model.

    Even if your visual check shows 'happy' inside your file, a few reasons why it might not wind up inside the model include:

    • it doesn't occur at least min_count=5 times

    • the data format isn't correct for Word2Vec, so it's not seeing the words you expect it to see.

    Looking at how data is prepared by your code, it looks like a giant list of all words in your file. Word2Vec instead expects a sequence that has, as each item, a list-of-words for that one text. So: not a list-of-words, but a list where each item is a list-of-words.

    If you've supplied...


    ...instead of the expected...

      ['happy', 'birthday',],

    ...those single-word-strings will be seen a lists-of-characters, so Word2Vec will think you want to learn word-vectors for a bunch of one-character words. You can check if this has affected your model by seeing if the vocabulary size seems small (len(model.wv)) or if a sample of learned-words is only single-character words ('model.wv.index2entity[:10]`).

    If you supply a word in the right format, at least min_count times, as part of the training-data, it will wind up with a vector in the model.

    (Separately: size=10000 is a choice way outside the usual range of 100-400. I've never seen a project using such high-dimensionality for word-vectors, and it would only be theoretically justifiable if you had a massively-large vocabulary and training-set. Oversized vectors with smaller vocabularies/data are likely to create uselessly overfit results.)