python gensim word2vec word-embedding recommendation-engine

Understanding results of word2vec gensim for finding substitutes

I have implemented the word2vec model on transaction data (link) of a single category.
My goal is to find substitutable items from the data.
The model is giving results but I want to make sure that my model is giving results based on customers historical data (considering context) and not just based on content (semantic data). Idea is similar to the recommendation system.
I have implemented this using the gensim library, where I passed the data (products) in form of a list of lists.

Eg.

[['BLUE BELL ICE CREAM GOLD RIM', 'TILLAMK CHOC CHIP CK DOUGH  IC'],
 ['TALENTI SICILIAN PISTACHIO GEL', 'TALENTI BLK RASP CHOC CHIP GEL'],
 ['BREYERS HOME MADE VAN ICE CREAM',
  'BREYERS HOME MADE VAN ICE CREAM',
  'BREYERS COFF ICE CREAM']]

Here, each of the sub lists is the past one year purchase history of a single customer.

# train word2vec model
model = Word2Vec(window = 5, sg = 0,
                 alpha=0.03, min_alpha=0.0007,
                 seed = 14)

model.build_vocab(purchases_train, progress_per=200)

model.train(purchases_train, total_examples = model.corpus_count, 
            epochs=10, report_delay=1)

# extract all vectors
X = []
words = list(model.wv.index_to_key)
for word in words:
    x = model.wv.get_vector(word)
    X.append(x)
Y = np.array(X)
Y.shape

def similar_products(v, n = 3):
    
    # extract most similar products for the input vector
    ms = model.wv.similar_by_vector(v, topn= n+1)[1:]
    
    # extract name and similarity score of the similar products
    new_ms = []
    for j in ms:
        pair = (products_dict[j[0]][0], j[1]) 
        new_ms.append(pair)
        
    return new_ms 

similar_products(model.wv['BLUE BELL ICE CREAM GOLD RIM'])

Results:

 [('BLUE BELL ICE CREAM BROWN RIM', 0.7322707772254944),
     ('BLUE BELL ICE CREAM LIGHT', 0.4575043022632599),
     ('BLUE BELL ICE CREAM NSA', 0.3731085956096649)]

To get intuitive understanding of word2vec and its working on how results are obtained, I created a dummy dataset where I wanted to find subtitutes of 'FOODCLUB VAN IC PAIL'.
If two products are in the same basket multiple times then they are substitutes.
Looking at the data first substitute should be 'FOODCLUB CHOC CHIP IC PAIL' but the results I obtained are:

[('FOODCLUB NEAPOLITAN IC PAIL', 0.042492810636758804),
 ('FOODCLUB COOKIES CREAM ICE CREAM', -0.04012278839945793),
 ('FOODCLUB NEW YORK VAN IC PAIL', -0.040678512305021286)]

Can anyone help me understand the intuitive working of word2vec model in gensim? Will each product be treated as word and customer list as sentence?
Why are my results so absurd in dummy dataset? How can I improve?
What hyperparameters play a significant role w.r.t to this model? Is negative sampling required?

Solution

You may not get a very good intuitive understanding of usual word2vec behavior using these sorts of product-baskets as training data. The algorithm was originally developed for natural-language texts, where texts are runs of tokens whose frequencies, & co-occurrences, follow certain indicative patterns.

People certainly do use word2vec on runs-of-tokens that aren't natural language - like product baskets, or logs-of-actions, etc – but to the extent such tokens have very-different patterns, it's possible extra preprocessing or tuning will be necessary, or useful results will be harder to get.

As just a few ways customer-purchases might be different from real language, depending on what your "pseudo-texts" actually represent:

the ordering within a text might be an artifact of how you created the data-dump rather than anything meaningful
the nearest-neighbors to each token within the window may or may not be significant, compared to more distant tokens
customer ordering patterns might in general not be as reflective of shades-of-relationships as words-in-natural-language text

So it's not automatic that word2vec will give interesting results here, for recommendatinos.

That's especially the case with small datasets, or tiny dummy datasets. Word2vec requires lots of varied data to pack elements into interesting relative positions in a high-dimensional space. Even small demos usually have a vocabulary (count of unique tokens) of tens-of-thousands, with training texts that provide varied usage examples of every token dozens of times.

Without that, the model never learns anything interesing/generalizable. That's especially the case if trying to create a many-dimensions model (say the default vector_size=100) with a tiny vocabulary (just dozens of unique tokens) with few usage examples per example. And it only gets worse if tokens appear fewer than the default min_count=5 times – when they're ignored entirely. So don't expect anything interesting to come from your dummy data, at all.

If you want to develop an intuition, I'd try some tutorials & other goals with real natural language text 1st, with a variety of datasets & parameters, to get a sense of what has what kind of effects on result usefulness – & only after that try to adapt word2vec to other data.

Negative-sampling is the default, & works well with typical datasets, especially as they grow large (where negative-sampling suffes less of a performance hit than hierarchical-softmax with large vocabularies). But a toggle between those two modes is unlike to cause giant changes in quality unless there are other problems.

Sufficient data, of the right kind, is the key – & then tweaking parameters may nudge end-result usefulness in a better direction, or shift it to be better for certain purposes.

But more specific parameter tips are only possible with clearer goals, once some baseline is working.