With my team, we're currently building an API using FastAPI and we're really struggling to get good performances out of it once deployed to Kubernetes. We're using async calls as much as possible but overall sitting at ~8RPS / pod to stay under our SLA of P99 200ms.
For resources, we assign the following:
resources:
limits:
cpu: 1
memory: 800Mi
requests:
cpu: 600m
memory: 100Mi
Surprisingly, such performance drops don't occur when running load tests on the API running locally in a Docker container. There we easily get ~200RPS on a single container with 120ms latency at P99...
Would anyone have an idea of what could go wrong in there and where I could start looking to find the bottleneck?
It finally appeared that our performance issues were caused by the non-usage of gunicorn
and only uvicorn
(even though FastAPI's author recommends against this in his documentation). On the other hand, Uvicorn authors are recommending the other way round in their docs, i.e, using gunicorn
. We followed that advice and our performances issues were gone.
As suggested by people in this thread, setting more CPUs in request of our PodSpec was also part of the solution.
EDIT: In the end, we finally discovered that the performance issues were caused by our implementation of OpenTelemetry over FastAPI using the opentelemetry-instrument CLI. The latter was causing a lot of overhead and blocking calls over the async of FastAPI. Performances are now super stable using both gunicorn or uvicorn. We are still using gunicorn with multiple workers but we are also planning to move back to uvicorn single-process and scale more dynamically.