Search code examples
pythongensimword2vec

Gensim- KeyError: 'word not in vocabulary'


I am trying to achieve something similar in calculating product similarity used in this example. how-to-build-recommendation-system-word2vec-python/

I have a dictionary where the key is the item_id and the value is the product associated with it. For eg: dict_items([('100018', ['GRAVY MIX PEPPER']), ('100025', ['SNACK CHEEZIT WHOLEGRAIN']), ('100040', ['CAULIFLOWER CELLO 6 CT.']), ('100042', ['STRIP FRUIT FLY ELIMINATOR'])....)

The data structure is the same as in the example (as far as I know). However, I am getting KeyError: "word '100018' not in vocabulary" when calling the similarity function on the model using the key present in the dictionary.

# train word2vec model
model = Word2Vec(window = 10, sg = 1, hs = 0,
             negative = 10, # for negative sampling
             alpha=0.03, min_alpha=0.0007,
             seed = 14)
model.build_vocab(purchases_train, progress_per=200)
model.train(purchases_train, total_examples = model.corpus_count, 
        epochs=10, report_delay=1)

def similar_products(v, n = 6): #similarity function

# extract most similar products for the input vector
ms = model.similar_by_vector(v, topn= n+1)[1:]

# extract name and similarity score of the similar products
new_ms = []
for j in ms:
    pair = (products_dict[j[0]][0], j[1])
    new_ms.append(pair)
    
return new_ms    

I am calling the function using:

similar_products(model['100018'])

Note: I was able to run the example code with the very similar data structure input which was also a dictionary. Can someone tell me what I am missing here?


Solution

  • If you get a KeyError telling you a word isn't in your model, then the word genuinely isn't in the model.

    If you've trained the model yourself, and expected the word to be in the resulting model, but it isn't, something went wrong with training.

    You should look at the corpus (purchases_train in your code) to make sure each item is of the form the model expects: a list of words. You should enable logging during training, and watch the output to confirm the expected amount of word-discovery and training is happening. You can also look at the exact list-of-words known-to-the-model (in model.wv.key_to_index) to make sure it has all the words you expect.

    One common gotcha is that by default, for the best operation of the word2vec algorithm, the Word2Vec class uses a default min_count=5. (Word2vec only works well with multiple varied examples of a word's usage; a word appearing just once, or just a few times, usually won't get a good vector, and further, might make other surrounding word's vectors worse. So the usual best practice is to discard very-rare words.

    Is the (pseudo-)word '100018' in your corpus less than 5 times? If so, the model will ignore it as a word too-rare to get a good vector, or have any positive influence on other word-vectors.

    Separately, the site you're using example code from may not be a quality source of example code. It's changed a bunch of default values for no good reason - such as changing the alpha and min_alpha values to peculiar non-standard values, with no comment why. This is usually a signal that someone who doesn't know what they're doing is copying someone else who didn't know what they were doing's odd choices.