Search code examples
pythongensimword2vec

Why Word2Vec function returns me a lot of 0.99 values


I'm trying to apply a word2vec model on a review dataset. First of all I apply the preprocessing to the dataset:

df=df.text.apply(gensim.utils.simple_preprocess)

and this is the dataset that I get:

0       [understand, location, low, score, look, mcdon...
3       [listen, it, morning, tired, maybe, hangry, ma...
6       [super, cool, bathroom, door, open, foot, nugg...
19      [cant, find, better, mcdonalds, know, getting,...
27      [night, went, mcdonalds, best, mcdonalds, expe...
                              ...
1677    [mcdonalds, app, order, arrived, line, drive, ...
1693    [correct, order, filled, promptly, expecting, ...
1694    [wow, fantastic, eatery, high, quality, ive, e...
1704    [let, tell, eat, lot, mcchickens, best, ive, m...
1716    [entertaining, staff, ive, come, mcdees, servi...
Name: text, Length: 283, dtype: object

Now I create the Word2Vec model and train it:

model = gensim.models.Word2Vec(sentences=df, vector_size=200, window=10, min_count=1, workers=6)
model.train(df,total_examples=model.corpus_count,epochs=model.epochs)
print(model.wv.most_similar("service",topn=10))

What I dont understand is that the function most_similar() returns to me a lot of 0.99 of similarity.

[('like', 0.9999310970306396), ('mcdonalds', 0.9999251961708069), ('food', 0.9999234080314636), ('order', 0.999918520450592), ('fries', 0.9999175667762756), ('got', 0.999911367893219), ('window', 0.9999082088470459), ('way', 0.9999075531959534), ('it', 0.9999069571495056), ('meal', 0.9999067783355713)]

What am I doing wrong?


Solution

  • You're right that's not normal.

    It is unlikely that your df is the proper format Word2Vec expects. It needs a re-iterable Python sequence, where each item is a list of string tokens.

    Try displaying next(iter(df)), to see the 1st item in df, if iterated over as Word2Vec does. Does it look like a good piece of training data?

    Separately regarding your code:

    • min_count=1 is always a bad idea with Word2Vec - rare words can't get good vectors but do, in aggregate, serve a lot like random noise making nearby words harder to train. Generally, the default min_count=5 shouldn't be lowered unless you're sure that will help your results, because you can compare that value's effects versus lower values. And if it seems like too much of your vocabulary disappears because words don't appear even a measly 5 times, you likely have too little data for this data-hungry algorithm.
    • Only 283 texts are unlikely to be enough training data unless each text has tens of thousands of tokens. (And even if it were possible to squeeze some results from this far-smaller-than-ideal corpus, you might need to shrink the vector_size and/or increase the epochs to get the most out of minimal data.
    • If you supply a corpus to sentences in the Word2Vec() construction, you don't need to call .train(). It will have already automatically used that corpus fully as part of the constructor. (You only need to call the indepdendent, internal .build_vocab() & .train() steps if you didn't supply a corpus at construction-time.)

    I highly recommend you enable logging to at least the INFO level for the relevant classes (either all Gensim or just Word2Vec). Then you'll see useful logging/progress info which, if you read over, will tend to reveal problems like the redundant second training here. (That redundant training isn't the cause of your main problem, though.)