Search code examples
nlpdocumentationspacyvocabulary

How to find the vocabulary size of a spaCy model?


I am trying to find the vocabulary size of the large English model, i.e. en_core_web_lg, and I find three different sources of information:

  • spaCy's docs: 685k keys, 685k unique vectors

  • nlp.vocab.__len__(): 1340242 # (number of lexemes)

  • len(vocab.strings): 1476045

What is the difference between the three? I have not been able to find the answer in the docs.


Solution

  • The most useful numbers are the ones related to word vectors. nlp.vocab.vectors.n_keys tells you how many tokens have word vectors and len(nlp.vocab.vectors) tells you how many unique word vectors there are (multiple tokens can refer to the same word vector in md models).

    len(vocab) is the number of cached lexemes. In md and lg models most of those 1340242 lexemes have some precalculated features (like Token.prob) but there can be additional lexemes in this cache without precalculated features since more entries can be added as you process texts.

    len(vocab.strings) is the number of strings related to both tokens and annotations (like nsubj or NOUN), so it's not a particularly useful number. All strings used anywhere in training or processing are stored here so that the internal integer hashes can be converted back to strings when needed.