python keras neural-network spacy lemmatization

Spacy - number of lemma

I'm using spacy to replace every word in a sentence with a number/code, after I use the vector as a input of a recurrent neural network.

import spacy
 str="basing based base"
 sp = spacy.load('en_core_web_sm')
 sentence=sp(str)
 for w in sentence:
    print(w.text,w.lemma)

In the first layer of Neural network with keras, the Embedding layer, I have to know the max number of words in the look up table, someone know this number? Thank you

Solution

The lemma indices are in fact hashes, so there is not a continuous row of indices from 0 to the number of vocabulary entries. Even sp.vocab.strings["randomnonwordstring#"] gives you an integer.

For entry "base", the ID is 4715552063986449646 in sp.vocab (note it is a shared vocab both for forms and lemmas). You would never fit such a number of embeddings in a memory.

The correct solution is creating a dictionary transforming words into indices based on what you have in your training data.