Search code examples
pythonkerasneural-networkspacylemmatization

Spacy - number of lemma


I'm using spacy to replace every word in a sentence with a number/code, after I use the vector as a input of a recurrent neural network.

import spacy
 str="basing based base"
 sp = spacy.load('en_core_web_sm')
 sentence=sp(str)
 for w in sentence:
    print(w.text,w.lemma)

In the first layer of Neural network with keras, the Embedding layer, I have to know the max number of words in the look up table, someone know this number? Thank you


Solution

  • The lemma indices are in fact hashes, so there is not a continuous row of indices from 0 to the number of vocabulary entries. Even sp.vocab.strings["randomnonwordstring#"] gives you an integer.

    For entry "base", the ID is 4715552063986449646 in sp.vocab (note it is a shared vocab both for forms and lemmas). You would never fit such a number of embeddings in a memory.

    The correct solution is creating a dictionary transforming words into indices based on what you have in your training data.