Search code examples
spacynamed-entity-recognition

pretrained vectors not loading in spacy


I am training a custom NER model from scratch using the spacy.blank("en") model. I add custom word vectors to it. The vectors are loaded as follows:

from gensim.models.word2vec import Word2Vec
from gensim.models import KeyedVectors
med_vec = KeyedVectors.load_word2vec_format('./wikipedia-pubmed-and-PMC-w2v.bin', binary=True, limit = 300000)

and I add it to the blank model in this code snippet here:

def main(model=None, n_iter=3, output_dir=None):
    """Set up the pipeline and entity recognizer, and train the new entity."""
    random.seed(0)
    if model is not None:
        nlp = spacy.load(model) # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        nlp.vocab.reset_vectors(width=200)
        for idx in range(len(med_vec.index2word)):
            word = med_vec.index2word[idx]
            vector = med_vec.vectors[idx]
            nlp.vocab.set_vector(word, vector)
        for key, vector in nlp.vocab.vectors.items():
            nlp.vocab.strings.add(nlp.vocab.strings[key])
        nlp.vocab.vectors.name = 'spacy_pretrained_vectors'
        print("Created blank 'en' model")
......Code for training the ner

I then save this model.

When I try to load the model, nlp = spacy.load("./NDLA/vectorModel0")

I get the following error:


`~\AppData\Local\Continuum\anaconda3\lib\site-packages\thinc\neural\_classes\static_vectors.py in __init__(self, lang, nO, drop_factor, column)
     47         if self.nM == 0:
     48             raise ValueError(
---> 49                 "Cannot create vectors table with dimension 0.\n"
     50                 "If you're using pre-trained vectors, are the vectors loaded?"
     51             )

ValueError: Cannot create vectors table with dimension 0.
If you're using pre-trained vectors, are the vectors loaded?

I also get this warning:

 UserWarning: [W019] Changing vectors name from spacy_pretrained_vectors to spacy_pretrained_vectors_336876, to avoid clash with previously loaded vectors. See Issue #3853.
  "__main__", mod_spec)

The vocab directory in the model has a vectors file of size 270 MB. So I know it is not empty... What is causing this error?


Solution

  • You could try to pass all vectors at once instead of using a for loop.

    nlp.vocab.vectors = spacy.vocab.Vectors(data=med_vec.syn0, keys=med_vec.vocab.keys())
    

    So you're else statement would become like this:

    else:
        nlp = spacy.blank("en")  # create blank Language class
        nlp.vocab.reset_vectors(width=200)
        nlp.vocab.vectors = spacy.vocab.Vectors(data=med_vec.syn0, keys=med_vec.vocab.keys()) 
        nlp.vocab.vectors.name = 'spacy_pretrained_vectors'
        print("Created blank 'en' model")