I'm using spacy to replace every word in a sentence with a number/code, after I use the vector as a input of a recurrent neural network.
import spacy
str="basing based base"
sp = spacy.load('en_core_web_sm')
sentence=sp(str)
for w in sentence:
print(w.text,w.lemma)
In the first layer of Neural network with keras, the Embedding layer, I have to know the max number of words in the look up table, someone know this number? Thank you
The lemma indices are in fact hashes, so there is not a continuous row of indices from 0 to the number of vocabulary entries. Even sp.vocab.strings["randomnonwordstring#"]
gives you an integer.
For entry "base", the ID is 4715552063986449646
in sp.vocab
(note it is a shared vocab both for forms and lemmas). You would never fit such a number of embeddings in a memory.
The correct solution is creating a dictionary transforming words into indices based on what you have in your training data.