Search code examples
python-3.xnlpgensimspacysimilarity

Is there a way to load spacy trained model into gensim?


I want to get the list of similar words. Since Spacy doesn't have a built-in support for this I want to convert the spacy model to gensim word2vec and get the list of similar words.

I have tried to use the below method. But it is time consuming.

def most_similar(word):
    by_similarity = sorted(word.vocab, key=lambda w: word.similarity(w), reverse=True)
    return [w.orth_ for w in by_similarity[:10]]
nlp = spacy.load('en_core_web_md')
nlp.to_disk(filename)
nlp.vocab.vectors.to_disk(filename)

This does not save the model to a text file. Hence, I am not able to use the following method.

from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

glove_file = datapath('test_glove.txt')
tmp_file = get_tmpfile("test_word2vec.txt")

_ = glove2word2vec(glove_file, tmp_file)

Solution

  • step 1: Extract the words and their vectors for the Spacy model (see relevant documentation here).
    step 2 : Create an instance of the class gensim.models.keyedvectors.WordEmbeddingsKeyedVectors (see relevant documentation here).
    step 3: add add the words and vectors to the WordEmbeddingsKeyedVectors instance.

    import spacy
    from gensim.models.keyedvectors import WordEmbeddingsKeyedVectors
    
    nlp = spacy.load('en_core_web_lg')
    
    wordList =[]
    vectorList = []
    for key, vector in nlp.vocab.vectors.items():
        wordList.append(nlp.vocab.strings[key] )
        vectorList.append(vector)
    
    kv = WordEmbeddingsKeyedVectors(nlp.vocab.vectors_length)
    
    kv.add(wordList, vectorList)
    
    print(kv.most_similar('software'))
    # [('Software', 0.9999999403953552), ('SOFTWARE', 0.9999999403953552), ('Softwares', 0.738474428653717), ('softwares', 0.738474428653717), ('Freeware', 0.6730758547782898), ('freeware', 0.6730758547782898), ('computer', 0.67071533203125), ('Computer', 0.67071533203125), ('COMPUTER', 0.67071533203125), ('shareware', 0.6497008800506592)]