Search code examples
nlpspacyglove

Using glove.6B.100d.txt embedding in spacy getting zero lex.rank


I am trying to load glove 100d emebddings in spacy nlp pipeline.

I create the vocabulary in spacy format as follows:

python -m spacy init-model en spacy.glove.model --vectors-loc glove.6B.100d.txt

glove.6B.100d.txt is converted to word2vec format by adding "400000 100" in the first line.

Now

spacy.glove.model/vocab has following files: 
5468549  key2row
38430528  lexemes.bin
5485216  strings.json
160000128  vectors

In the code:

import spacy 
nlp = spacy.load("en_core_web_md")

from spacy.vocab import Vocab
vocab = Vocab().from_disk('./spacy.glove.model/vocab')

nlp.vocab = vocab

print(len(nlp.vocab.strings)) 
print(nlp.vocab.vectors.shape) gives 

gives 407174 (400000, 100)

However the problem is that:

V=nlp.vocab
max_rank = max(lex.rank for lex in V if lex.has_vector)
print(max_rank) 

gives 0

I just want to use the 100d glove embeddings within spacy in combination with "tagger", "parser", "ner" models from en_core_web_md.

Does anyone know how to go about doing this correctly (is this possible)?


Solution

  • The tagger/parser/ner models are trained with the included word vectors as features, so if you replace them with different vectors you are going to break all those components.

    You can use new vectors to train a new model, but replacing the vectors in a model with trained components is not going to work well. The tagger/parser/ner components will most likely provide nonsense results.

    If you want 100d vectors instead of 300d vectors to save space, you can resize the vectors, which will truncate each entry to first 100 dimensions. The performance will go down a bit as a result.

    import spacy
    nlp = spacy.load("en_core_web_md")
    assert nlp.vocab.vectors.shape == (20000, 300)
    nlp.vocab.vectors.resize((20000, 100))