Search code examples
pythonpytorchfastapisentence-transformers

GPU out of memory when FastAPI is used with SentenceTransformers inference


I'm currently using FastAPI with Gunicorn/Uvicorn as my server engine. Inside FastAPI GET method I'm using SentenceTransformer model with GPU:

# ...

from sentence_transformers import SentenceTransformer

encoding_model = SentenceTransformer(model_name, device='cuda')

# ...
app = FastAPI()

@app.get("/search/")
def encode(query):
    return encoding_model.encode(query).tolist()

# ...

def main():
    uvicorn.run(app, host="127.0.0.1", port=8000)


if __name__ == "__main__":
    main()

I'm using the following config for Gunicorn:

TIMEOUT 0
GRACEFUL_TIMEOUT 120
KEEP_ALIVE 5
WORKERS 10

Uvicorn has all default settings, and is started in docker container casually:

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

So, inside docker container I have 10 gunicorn workers, each using GPU.

The problem is the following:

After some load my API fails with the following message:

torch.cuda.OutOfMemoryError: CUDA out of memory. 
Tried to allocate 734.00 MiB 
(GPU 0; 15.74 GiB total capacity; 
11.44 GiB already allocated; 
189.56 MiB free; 
11.47 GiB reserved in total by PyTorch) 
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Solution

  • The problem was that there were 10 replicas of my transformer model on GPU, as @Chris mentioned above. My solution was to use celery as RPC manager (rabbitmq+redis backend setup) and a separate container for GPU-bound computations, so now there is only one instance of my model on GPU, and no race between different processes' models.