MemoryError with FastApi and SpaCy

I am running a FastAPI (v0.63.0) web app that uses SpaCy (v3.0.5) for tokenizing input texts. After the web service has been running for a while, the total memory usage grows too big, and SpaCy throws MemoryErrors, results in 500 errors of the web service.

XXX.internal web[17552]: MemoryError:
XXX.internal web[17552]: INFO:     xxx - "POST /page HTTP/1.1" 500 Internal Server Error
XXX.internal web[17552]: ERROR:    Exception in ASGI application
[...]
XXX.internal web[17552]: Traceback (most recent call last):
XXX.internal web[17552]: File "spacy/tokens/token.pyx", line 263, in spacy.tokens.token.Token.text.__get__                                                                                                                                        
XXX.internal web[17552]: File "spacy/tokens/token.pyx", line 806, in spacy.tokens.token.Token.orth_.__get__                                                                                                                                       
XXX.internal web[17552]: File "spacy/strings.pyx", line 132, in spacy.strings.StringStore.__getitem__
XXX.internal web[17552]: KeyError: "[E018] Can't retrieve string for hash '10429668501569482890'. This usually refers to an issue with the `Vocab` or `StringStore`."

Here's the relevant part of my main.py:

@app.post(f"/page", response_model=PageResponse)
async def classify(request: PageRequest):
    try:
        preprocessed = await preprocessor.preprocess(request.text)
    [...]

The preprocessor object is an instance of a class, its preprocess method calls the SpaCy tokenizer:

class SpacyTokenizer(Tokenizer):
    def __init__(self, nlp: spacy.Language):
        self._nlp = spacy.load("en_core_web_sm")

        for component in self._nlp.pipe_names:
            # we only need tokenization
            self._nlp.remove_pipe(component)

    def tokenize(self, text: str) -> Iterable[str]:
        if len(text) >= self._nlp.max_length:
            raise ValueError(f"Text too long: {len(text)} characters.")

        try:
            doc = self._nlp(text)
            return islice(
                (token.text for token in doc), settings.SPACY_MAX_TOKENS
            )
        except MemoryError:
            raise ValueError(f"Text too long: {len(text)} characters.")

As you can see in the code, I have tried to prevent the issue by capping the number of tokens that are generated, and by catching the MemoryError. Neither seems to have any effect though (I do understand that catching a MemoryError conceptually won't work usually).

I have observed that the worker processes on the server machine keep using more memory over time:

17552 webapp    20   0 2173336   1,6g   7984 S  4,7 79,9  33:29.04 uvicorn

When the process is started, the uvicorn process takes ~700MB instead of 1,6g.

From the error messages, I suppose it is quite clear that the SpaCy tokenizer is the main culprit. However I would expect it to release a worked thread to release its memory when a request has been processed, so FastAPI or Uvicorn also seem to be a plausible root causes.

My main question is though: where and how can I debug this?

A similar discussion about an old SpaCy issue suggests that reloading the nlp object occasionally could be a work-around. I am not sure though if that is still applies to more recent SpaCy versions, and how that should be tackled.

On the other hand, are there FastAPI or Uvicorn options that could take care of releasing memory of their threads?

Solution

The SpaCy tokenizer seems to cache each token in a map internally. Consequently, each new token increases the size of that map. Over time, more and more new tokens inevitably occur (although with decreasing speed, following Zipf's law). At some point, after having processed large numbers of texts, the token map will thus outgrow the available memory. With a large amount of available memory, of course this can be delayed for a very long time.

The solution I have chosen is to store the SpaCy model in a TTLCache and to reload it every hour, emptying the token map. This adds some extra computational cost for reloading the SpaCy model, but that is almost negligible.