I have trained my corpus on w2v and k-means following the instructions given this link.
https://ai.intelligentonlinetools.com/ml/k-means-clustering-example-word2vec/
What I am want to do this a. find the cluster ID for a given word b. get the top 20 nearest words from the cluster for the given word.
I have figured out how to the words in a given cluster. What I want is to find out the words that are closer to my given word in the given cluster.
Any help is appreciated.
Your linked guide is, with its given data, a bit of misguided. You can't get meaningful 100-dimensional word-vectors (the gensim Word2Vec
class default) from a mere 30-word corpus. The results from such a model will be nonsense, useless for clustering or other downstream steps – so any tutorial purporting to show this process, with true results, should be using far more data.
If you are in fact using far more data, and have succeeded in clustering words, the Word2Vec
model's most_similar()
function will give you the top-N (default 10) nearest-words for any given input word. (Specifically, they will be returned as (word, cosine_similarity)
tuples, ranked by highest cosine_similarity
.)
The Word2Vec
model is of course oblivious to the results of clustering, so you would have to filter those results to discard words outside the cluster of interest.
I'll assume that you have some lookup object cluster
, that for cluster[word]
gives you the cluster ID for a specific word. (This might be a dict, or something that does a KMeans-model predict()
on the supplied vector, whatever.) And, that total_words
is the total number of words in your model. (For example: total_words = len(w2v_model.wv)
. Then your logic should be roughly like
target_cluster = cluster[target_word]
all_similars = w2v_model.wv.most_similar(target_word, topn=total_words)
in_cluster_similars = [sim for sim in all_similars
if cluster[sim[0]] = target_cluster]
If you just want the top-20 results, clip to in_cluster_similars[:20]
.