Search code examples
pythonnlpgensimword2veckeyerror

How to handle KeyError(f"Key '{key}' not present") wor2vec with gensim


I have build a model with gensim library and am trying to get the vector of word that not present in the vocabulary but i have an error, and i want to handle this error with the best i way. If i can get the vector of word not present in the model that well be perfect.

The code

model = KeyedVectors.load('nice.model')
token_vector = model.wv['bla bla bla']

Error

  File "/home/ahmed/PycharmProjects/WebScarping/venv/lib/python3.9/site-packages/gensim/models/keyedvectors.py", line 421, in get_index
    raise KeyError(f"Key '{key}' not present")
KeyError: "Key 'hmed' not present"

please help me in resolving the error


Solution

  • If the token is not present in the model, it can't give you a vector for it.

    Your model doesn't have a vector for the (pseudo-)word 'bla bla bla', all it can do is report that.

    You could avoid the exception by pre-checking whether the token is present, and only requesting it if present:

    if token in model.wv:
        token_vector = model.wv[token]
    else:
        # whatever your next-best step is when a vector not available
        ...
    

    Or, you could catch the exception:

    try:
        token_vector = model.wv[token]
    except KeyError:
        # whatever your next-best step is when a vector not available
        ...
    

    But there's no magic way to create a good vector for an unknown token. You'll have to ignore such words, or make-up some plug stand-in value, or figure some other project-appropriate workaround.

    (If you have sufficient training data with varied examples of the token's real usage, you could train a model that includes the token. You could also consider finding or training a word2vec variant model like FastText, which can synthesize guess-vectors for unknown tokens based on which substrings they might share with words learned in training – but such vectors may be quite poor in quality.)