I want to get the list of similar words. Since Spacy doesn't have a built-in support for this I want to convert the spacy model to gensim word2vec and get the list of similar words.
I have tried to use the below method. But it is time consuming.
def most_similar(word):
by_similarity = sorted(word.vocab, key=lambda w: word.similarity(w), reverse=True)
return [w.orth_ for w in by_similarity[:10]]
nlp = spacy.load('en_core_web_md')
nlp.to_disk(filename)
nlp.vocab.vectors.to_disk(filename)
This does not save the model to a text file. Hence, I am not able to use the following method.
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
glove_file = datapath('test_glove.txt')
tmp_file = get_tmpfile("test_word2vec.txt")
_ = glove2word2vec(glove_file, tmp_file)
step 1: Extract the words and their vectors for the Spacy model (see relevant documentation here).
step 2 : Create an instance of the class gensim.models.keyedvectors.WordEmbeddingsKeyedVectors
(see relevant documentation here).
step 3: add add the words and vectors to the WordEmbeddingsKeyedVectors instance.
import spacy
from gensim.models.keyedvectors import WordEmbeddingsKeyedVectors
nlp = spacy.load('en_core_web_lg')
wordList =[]
vectorList = []
for key, vector in nlp.vocab.vectors.items():
wordList.append(nlp.vocab.strings[key] )
vectorList.append(vector)
kv = WordEmbeddingsKeyedVectors(nlp.vocab.vectors_length)
kv.add(wordList, vectorList)
print(kv.most_similar('software'))
# [('Software', 0.9999999403953552), ('SOFTWARE', 0.9999999403953552), ('Softwares', 0.738474428653717), ('softwares', 0.738474428653717), ('Freeware', 0.6730758547782898), ('freeware', 0.6730758547782898), ('computer', 0.67071533203125), ('Computer', 0.67071533203125), ('COMPUTER', 0.67071533203125), ('shareware', 0.6497008800506592)]