I'm currently using FastAPI with Gunicorn/Uvicorn as my server engine. Inside FastAPI GET
method I'm using SentenceTransformer
model with GPU:
# ...
from sentence_transformers import SentenceTransformer
encoding_model = SentenceTransformer(model_name, device='cuda')
# ...
app = FastAPI()
@app.get("/search/")
def encode(query):
return encoding_model.encode(query).tolist()
# ...
def main():
uvicorn.run(app, host="127.0.0.1", port=8000)
if __name__ == "__main__":
main()
I'm using the following config for Gunicorn:
TIMEOUT 0
GRACEFUL_TIMEOUT 120
KEEP_ALIVE 5
WORKERS 10
Uvicorn has all default settings, and is started in docker container casually:
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
So, inside docker container I have 10 gunicorn workers, each using GPU.
The problem is the following:
After some load my API fails with the following message:
torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 734.00 MiB
(GPU 0; 15.74 GiB total capacity;
11.44 GiB already allocated;
189.56 MiB free;
11.47 GiB reserved in total by PyTorch)
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
The problem was that there were 10 replicas of my transformer model on GPU, as @Chris mentioned above.
My solution was to use celery
as RPC manager (rabbitmq+redis backend setup) and a separate container for GPU-bound computations, so now there is only one instance of my model on GPU, and no race between different processes' models.