Search code examples
amazon-web-serviceskubernetesload-balancing

Why is 1 pod replica slower than other 5?


I have a multinode kubernetes cluster with 6 pods (replicas, 8Gi, 4 CPU cores) running on different nodes residing in Auto Scaling Group. These pods contain an Application that serves REST API, and is connected to Redis.

For all the requests going through ALB configured via ingress, some requests are painfully slower than the others.

When I sent the requests at Pod-IP level, I found 1 pod to be much slower (almost 5 times as slow) than the other 5, bringing down the total response-time drastically.

I tried killing the pod, such that the deployment spinned up a new one which worked fine. The issue is, some other pod went slow because of this. The ratio of fast:slow is maintained at 5:1.

The CPU-utilization of the pods is below 30% and have ample available resources.

I am not able to figure out the reason. Please help.


Solution

  • I am not the questioneer but oddly enough ran into a similar issue that could not be attributed to anything obvious per se.

    After a lot of debugging and turning every stone we finally disabled the Prometheus Operator scraping our pods by removing the required annotation. The "1 pod performance issue" magically disappeared.

    We kubectl forwarded one of the pods and checked our metrics endpoint: it was generating 6 MB (!) of metric data which is quite a lot and took around 700-1000ms to generate when no load is present. It turned out that a custom metric of ours had a regression and created a lot of tag-variants for a specific metric which attributed to nearly 3 MB of the generated metrics. The next issue was Kafka Streams which generates a lot of very detailed metrics (even on a per Kafka Stream node operation basis or tagged in regards to every node in the connected Kafka cluster). There is no easy way for us to disable the collection of these metrics but we just excluded them from our prometheus endpoint.

    This left us with a meager 32kb of metrics. Reactivating the Prometheus Operator scraping did not re-introduce the observed issue.

    But why one pod? We basically have two Prometheus Operator instances scraping all registered endpoints every 30 seconds which leads to an average scraping interval of around 15 seconds. We checked our http metrics and then it struck us: one pod is scraped 8-10x more often compared to any other pod! Considering a high load scenario its not unlikely that the prometheus endpoint exceeds 1.5 seconds to respond which would mean that another scraping process is started while the previous scrape is not completed yet. All this was adding up more and more CPU usage leading to more throttling of the pod because it was hitting its CPU limit which in turn increased the metrics endpoints response time which lead to more concurrent scrapes generating 6 MB of data.

    As to why one pod was scraped this often: we have no definite answer for this as of now as our systems team is still investigating. Sadly enough the 8-10x amount of scraping disappeared after we reduced our metrics endpoints response size.

    We basically got DDOSd by metrics scraping which occured too often on one pod (reason unknown). This has to be the most complex thing I have ever debugged. We basically removed every part of our application (DB layer, tracing, prometheus, custom metrics) until we found the culprit. We even considered if specific Kubernetes nodes where the culprit and even checked if something like entropy was running out for whatever reason.

    Good luck with your issue, I hope this helps some poor soul not waste more than a week of searching for a needle in a haystack!