I run the following code and just wonder why the top 3 most similar words for "exposure" don't include "charge" and "lend"?
from gensim.models import Word2Vec
corpus = [['total', 'exposure', 'charge', 'lend'],
['customer', 'paydown', 'rate', 'months', 'month']]
gens_mod = Word2Vec(corpus, min_count=1, vector_size=300, window=2, sg=1, workers=1, seed=1)
keyword="exposure"
gens_mod.wv.most_similar(keyword)
Output:
[('customer', 0.12233059108257294),
('month', 0.008674687705934048),
('total', -0.011738087050616741),
('rate', -0.03600010275840759),
('months', -0.04291829466819763),
('paydown', -0.044823747128248215),
('lend', -0.05356598272919655),
('charge', -0.07367636263370514)]
The word2vec algorithm is only useful & valuable with large amounts of training data, where every word of interest has a variety of realistic, subtly-contrasting usage examples.
A toy-sized dataset won't show its value. It's always a bad idea to set min_count=1
. And, it's nonsensical to try to train 300-dimensional word-vectors from a corpus of only 9 words, 9 unique words, and most of the words having the exact same neighbors.
Try it on a more realistic dataset - tens-of-thousands of unique words, all with multiple usage examples – and you'll see more intuitively-correct similarity results.