I'm trying to apply a word2vec model on a review dataset. First of all I apply the preprocessing to the dataset:
df=df.text.apply(gensim.utils.simple_preprocess)
and this is the dataset that I get:
0 [understand, location, low, score, look, mcdon...
3 [listen, it, morning, tired, maybe, hangry, ma...
6 [super, cool, bathroom, door, open, foot, nugg...
19 [cant, find, better, mcdonalds, know, getting,...
27 [night, went, mcdonalds, best, mcdonalds, expe...
...
1677 [mcdonalds, app, order, arrived, line, drive, ...
1693 [correct, order, filled, promptly, expecting, ...
1694 [wow, fantastic, eatery, high, quality, ive, e...
1704 [let, tell, eat, lot, mcchickens, best, ive, m...
1716 [entertaining, staff, ive, come, mcdees, servi...
Name: text, Length: 283, dtype: object
Now I create the Word2Vec model and train it:
model = gensim.models.Word2Vec(sentences=df, vector_size=200, window=10, min_count=1, workers=6)
model.train(df,total_examples=model.corpus_count,epochs=model.epochs)
print(model.wv.most_similar("service",topn=10))
What I dont understand is that the function most_similar() returns to me a lot of 0.99 of similarity.
[('like', 0.9999310970306396), ('mcdonalds', 0.9999251961708069), ('food', 0.9999234080314636), ('order', 0.999918520450592), ('fries', 0.9999175667762756), ('got', 0.999911367893219), ('window', 0.9999082088470459), ('way', 0.9999075531959534), ('it', 0.9999069571495056), ('meal', 0.9999067783355713)]
What am I doing wrong?
You're right that's not normal.
It is unlikely that your df
is the proper format Word2Vec
expects. It needs a re-iterable Python sequence, where each item is a list of string tokens.
Try displaying next(iter(df))
, to see the 1st item in df
, if iterated over as Word2Vec
does. Does it look like a good piece of training data?
Separately regarding your code:
min_count=1
is always a bad idea with Word2Vec
- rare words can't get good vectors but do, in aggregate, serve a lot like random noise making nearby words harder to train. Generally, the default min_count=5
shouldn't be lowered unless you're sure that will help your results, because you can compare that value's effects versus lower values. And if it seems like too much of your vocabulary disappears because words don't appear even a measly 5 times, you likely have too little data for this data-hungry algorithm.vector_size
and/or increase the epochs
to get the most out of minimal data.sentences
in the Word2Vec()
construction, you don't need to call .train()
. It will have already automatically used that corpus fully as part of the constructor. (You only need to call the indepdendent, internal .build_vocab()
& .train()
steps if you didn't supply a corpus at construction-time.)I highly recommend you enable logging to at least the INFO
level for the relevant classes (either all Gensim or just Word2Vec
). Then you'll see useful logging/progress info which, if you read over, will tend to reveal problems like the redundant second training here. (That redundant training isn't the cause of your main problem, though.)