python memory spacy named-entity-recognition

Python: Spacy NER and memory consumption

I use SPACY for named entity recognition. I have my own trained model on en_core_web_md. The size of my model is 223 megabytes. When the model is loaded into memory, it uses 800 megabytes. Is it possible somehow for NER purposes not to load everything (lexemes.bin, string.json, key2row), but only vectors and model (which weigh 4 and 24 megabytes respectively) to consume much less memory? or is it all necessary to load for NER?

Solution

For spacy v2.2, it is necessary to load everything. There is one minor bug that affects key2row in md models: to improve the size and loading time of key2row in md models with versions v2.2.0-v2.2.5, see https://stackoverflow.com/a/60541041/461847.

The bug related to key2row is fixed in v2.2.4 if you're training a model from scratch with your own custom vectors, but the provided v2.2 md models will still have this issue.

Planned for v2.3: removal of lexemes.bin with lexemes only created on demand. With these changes, the md models will be about 50% smaller on disk and the initial model loading is about 50% faster. The English md model looks like it's about 300MB smaller in memory when loaded initially, but memory usage will increase a bit in use as it builds a lexeme cache. See: https://github.com/explosion/spaCy/pull/5238