Search code examples
multithreadingasp.net-corekuberneteskestrel-http-server

How to troubleshoot thread starvation in ASP.NET Core on Linux (Kubernetes)?


I'm running an ASP.NET Core API on Linux, on Kubernetes in the Google Cloud.

This is an API with high load, and on every request it's executing a library doing a long (1-5 seconds), CPU-intensive operation.

What I see is that after deployment the API works properly for a while, but after 10-20 minutes it becomes unresponsive, and even the health check endpoint (which just returns a hardcoded 200 OK) stops working and times out. (This makes Kubernetes kill the pods.)

Sometimes I'm also seeing the infamous Heartbeat took longer than "00:00:01" error message in the logs.

Googling these phenomenons points me to "Thread starvation", so that there are too many thread pool threads started, or too many threads are blocking waiting on something, so that there are no threads left in the pool which could pick up ASP.NET Core requests (hence the timeout of even the health check endpoint).

What is the best way to troubleshoot this issue? I started monitoring the numbers returned by ThreadPool.GetMaxThreads and ThreadPool.GetAvailableThreads, but they stayed constant (the completion port is always 1000 both for max and available, and the worker is always 32767).
Is there any other property I should monitor?


Solution

  • Are you sure your ASP.NET Core web app is running out of threads? It may be it is simply saturating all available pod resources, causing Kubernetes to just kill down the pod itself, and so your web app.

    I did experience a very similar scenario with an ASP.NET Core web API running on Linux RedHat within an OpenShift environment, which also supports the pod concept as in Kubernetes: one call required approximately 1 second to complete and, under large workload, it became first slower and then unresponsive, causing OpenShift to kill down the pod, and so my web app.

    It may be your ASP.NET Core web app is not running out of threads, especially considering the high amount of worker threads available in the ThreadPool. Instead, the number of active threads combined with their CPU need is probably too large compared to the actual millicores available within the pod where they are running: indeed, after being created, those active threads are too many for the available CPU that most of them end up being queued by the scheduler and waiting for execution, while only a bunch will actually run. The scheduler then does its job, making sure CPU is shared fairly among threads, by frequently switching those that would use it. As for your case, where threads require heavy and long CPU bound operations, over time resources get saturated and the web app becomes unresponsive.

    A mitigation step may be providing more capacity to your pods, especially millicores, or increase the number of pods Kubernetes may deploy based on need. However, in my particular scenario this approach did not help much. Instead, improving the API itself by reducing the execution of one request from 1s to 300ms sensibly improved the overall web application performance and actually solved the issue.

    For example, if your library performs the same calculations in more than one request, you may consider introducing caching on your data structures in order to enhance speed at the slight cost of memory (which worked for me), especially if your operations are mainly CPU bound and if you have such request demands to your web app. You may also consider enabling cache response in ASP.NET Core too if that makes sense with the workload and responses of your API. Using cache, you make sure your web app does not perform the same task twice, freeing up CPU and reduce the risk of queuing threads.

    Processing each request faster will make your web app less prone to the risk of filling up the available CPU and therefore reduce the risk of having too many threads queued and waiting for execution.