python machine-learning nlp word2vec gensim

How to find semantic similarity using gensim and word2vec in python

I have a list of words in my python programme. Now I need to iterate through this list and find out the semantically similar words and put them into another list. I have been trying to do this using gensim with word2vec but could find a proper solution.This is what I have implemeted up to now. I need a help on how to iterate through the list of words in the variable sentences and find the semantically similar words and save it in another list.

import gensim, logging

import textPreprocessing, frequentWords , summarizer
from gensim.models import Word2Vec, word2vec

import numpy as np
from scipy import spatial

sentences = summarizer.sorteddict

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
model = word2vec.Word2Vec(sentences, iter=10, min_count=5, size=300, workers=4)

Solution

If you don't care about proper clusters, you can use this code:

similar = [[item[0] for item in model.most_similar(word)[:5]] for word in words]

If you really want to clusterize the words, here are few notes:

There can be several such clusters.
The number of clusters depends on a hyperparameter, some threshold. When the threshold is big, all of the words are similar and belong to the same cluster, when it's too small, none of them are.
Words can be naturally included transitively into a cluster, i.e. A is similar to B and B is similar to C, so all three should be in the same cluster. This means you'll have to implement some sort of graph traversal algorithm.
The performance greatly depends on the training corpus: only if it's large enough, gensim word2vec will be able to capture proper similarity. Gemnsim hyperparameters and text pre-processing thus also matter.

Here's a naive and probably not very efficient algorithm and identifies clusters:

model = gensim.models.word2vec.Word2Vec(sentences, iter=10, min_count=5, size=300, workers=4)
vocab = model.wv.vocab.keys()

threshold = 0.9
clusters = {}
for word in vocab:
  for similar_word, distance in model.most_similar(word)[:5]:
    if distance > threshold:
      cluster1 = clusters.get(word, set())
      cluster2 = clusters.get(similar_word, set())
      joined = set.union(cluster1, cluster2, {word, similar_word})
      clusters[word] = joined
      clusters[similar_word] = joined