I am trying to find the vocabulary size of the large English model, i.e. en_core_web_lg
, and I find three different sources of information:
spaCy's docs: 685k keys, 685k unique vectors
nlp.vocab.__len__()
: 1340242 # (number of lexemes)
len(vocab.strings)
: 1476045
What is the difference between the three? I have not been able to find the answer in the docs.
The most useful numbers are the ones related to word vectors. nlp.vocab.vectors.n_keys
tells you how many tokens have word vectors and len(nlp.vocab.vectors)
tells you how many unique word vectors there are (multiple tokens can refer to the same word vector in md
models).
len(vocab)
is the number of cached lexemes. In md
and lg
models most of those 1340242
lexemes have some precalculated features (like Token.prob
) but there can be additional lexemes in this cache without precalculated features since more entries can be added as you process texts.
len(vocab.strings)
is the number of strings related to both tokens and annotations (like nsubj
or NOUN
), so it's not a particularly useful number. All strings used anywhere in training or processing are stored here so that the internal integer hashes can be converted back to strings when needed.