Search code examples
gensimword2vec

word2vec recommendation system KeyError: "word '21883' not in vocabulary"


The code works absolutely fine for the data set containing 500000+ instances but whenever I reduce the data set to 5000/10000/15000 it throws a key error : word "***" not in vocabulary.Not for every data point but for most them it throws the error.The data set is in excel format. [1]: https://i.sstatic.net/YCBiQ.png I don't know how to fix this problem since i have very little knowledge about it,,I am still learning.Please help me fix this problem!

    purchases_train = []
    for i in tqdm(customers_train):
        temp = train_df[train_df["CustomerID"] == i]["StockCode"].tolist()
        purchases_train.append(temp)

    purchases_val = []
    for i in tqdm(validation_df['CustomerID'].unique()):
        temp = validation_df[validation_df["CustomerID"] == i]["StockCode"].tolist()
        purchases_val.append(temp)


    model = Word2Vec(window = 10, sg = 1, hs = 0,
                     negative = 10, # for negative sampling
                     alpha=0.03, min_alpha=0.0007,
                     seed = 14)

    model.build_vocab(purchases_train, progress_per=200)

    model.train(purchases_train, total_examples = model.corpus_count, 
                epochs=10, report_delay=1)


    model.save("word2vec_2.model")
    model.init_sims(replace=True)

    # extract all vectors
    X = model[model.wv.vocab]

    X.shape

    products = train_df[["StockCode", "Description"]]

    products.drop_duplicates(inplace=True, subset='StockCode', keep="last")


 products_dict=products.groupby('StockCode'['Description'].apply(list).to_dict()

    def similar_products(v, n = 6):
        ms = model.similar_by_vector(v, topn= n+1)[1:]
        new_ms = []
        for j in ms:
            pair = (products_dict[j[0]][0], j[1])
            new_ms.append(pair)

        return new_ms

        similar_products(model['21883'])

Solution

  • If you get a KeyError saying a word is not in the vocabulary, that's a reliable indicator that the word you're looking-up was not in the training data fed to Word2Vec, or did not appear enough (default min_count=5) times.

    So, your error indicates the word-token '21883' did not appear at least 5 times in the texts (purchases_train) supplied to Word2Vec. You should do either or both of:

    • Ensure all words you're going to look-up appear enough times, either with more training data or a lower min_count. (However, words with only one or a few occurrences tend not to get good vectors & instead just drag the quaality of surrounding-words' vectors down - so keeping this value above 1, or even raising it above the default of 5 to discard more rare words, is a better path whenever you have sufficient data.)

    • If your later code will be looking up words that might not be present, either check for their presence first (word in model.wv.vocab) or set up a try: ... except: ... to catch & handle the case where they're not present.