I have a list of words in my python programme. Now I need to iterate through this list and find out the semantically similar words and put them into another list. I have been trying to do this using gensim with word2vec but could find a proper solution.This is what I have implemeted up to now. I need a help on how to iterate through the list of words in the variable sentences and find the semantically similar words and save it in another list.
import gensim, logging
import textPreprocessing, frequentWords , summarizer
from gensim.models import Word2Vec, word2vec
import numpy as np
from scipy import spatial
sentences = summarizer.sorteddict
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
model = word2vec.Word2Vec(sentences, iter=10, min_count=5, size=300, workers=4)
If you don't care about proper clusters, you can use this code:
similar = [[item[0] for item in model.most_similar(word)[:5]] for word in words]
If you really want to clusterize the words, here are few notes:
A
is similar to B
and B
is similar to C
, so all three should be in the same cluster. This means you'll have to implement some sort of graph traversal algorithm.Here's a naive and probably not very efficient algorithm and identifies clusters:
model = gensim.models.word2vec.Word2Vec(sentences, iter=10, min_count=5, size=300, workers=4)
vocab = model.wv.vocab.keys()
threshold = 0.9
clusters = {}
for word in vocab:
for similar_word, distance in model.most_similar(word)[:5]:
if distance > threshold:
cluster1 = clusters.get(word, set())
cluster2 = clusters.get(similar_word, set())
joined = set.union(cluster1, cluster2, {word, similar_word})
clusters[word] = joined
clusters[similar_word] = joined