Search code examples
async-awaitthreadpoolstackexchange.redis

Under what circumstance could improper use of .NET 4.8 async/await bring down a multi-instance Azure App Service?


We have a large scale Asp.Net MVC application built on .Net Framework 4.8. We have rules based auto-scaling with minimum instances held during peak hours. We've recently been having outages during active hours. These outages are severe enough that it brings down all instances and only a restart (or slot swap) get us back up and running.

When these outages happen, we're not seeing a spike in requests or CPU (though spikes do occur a decent amount). That said, these types of outages are only occurring during active hours (mostly US based users). In fact, just preceding these outages we start to see a drop in requests with CPU seeming to stay steady.

Things to note:

We have a lot of poor use of async/await (or lack thereof). There is a plethora of .Result and .Wait calls on async methods. Some of them even inside of an async method. There are areas where we aren't using .ConfigureAwait(false) and we should be. So on and so forth.

We are making a concerted effort to address these areas of technical debt.

We also use Redis as a SessionStateProvider in addition to using Redis as our cache mechanism. During the outages we see a lot of Redis Timeout exceptions. We do have Redis properly setup according to all the documentation that is out there (i.e. set min thread pool values, singleton Lazy multiplexer implementation, etc.). These timeouts seem to be a symptom of the underlying issue and NOT the cause from what we can see. Please correct me if I'm wrong about this.

The things we've done to try and mitigate these outages is slowly go through our code base and get these async/await issues taken care of. Try to mitigate bots from crawling us. Find our highest throughput request paths and make them async. And lots of reviewing of logs using New Relic.

The biggest question that is stumping us is the fact that it takes down the entire app service (not the plan, just service). Even if we manually scale-out during these incidents, it doesn't help.

Intuitively I'm thinking that this is an async thread locking situation but I'm not entirely sure how that could affect more than one instance of the App Service. Any help or thoughts would be appreciated.


Solution

  • It sounds an awful lot like thread pool exhaustion. You can verify this by taking a dump, loading it in VS, and examining the parallel stacks window. You'll probably see a large number of thread pool threads blocked on tasks and very few if any available for work (i.e., blocked on the thread pool work queue).

    Thread pool exhaustion is just when there are no more threads free. The thread pool will inject new threads when it gets full, but at a limited rate, which is not generally fast enough to recover from this situation.

    We have a lot of poor use of async/await (or lack thereof). There is a plethora of .Result and .Wait calls on async methods. Some of them even inside of an async method.

    This will cause thread pool exhaustion at scale. Of course, this kind of code is blocking threads, but this is worse than code that is synchronous all the way. Blocking on asynchronous code is worse for scalability than plain synchronous code.

    The reason is that almost all asynchronous APIs actually depend on a the thread pool to make forward progress. As an example, a database query or HTTP invocation will deserialize its response; internally it's doing a truly asynchronous read followed by a brief deserialization on a thread pool thread. So, when the thread pool gets saturated and there's threads blocked on asynchronous code, then the existing requests have difficulty completing.

    In a fully-synchronous world, the thread is already there (blocked), and when the operation completes it just continues executing. In a blocking-on-asynchronous-code world, one thread is blocked, but another thread has to be free in order to complete the operation and allow that request to continue.

    This same problem doesn't occur in a fully-asynchronous world because of an async optimization: a thread pool thread is necessary to complete the operation, but then that same thread just continues executing the rest of your handler code, until it hits another await or completes sending the response.

    Thread pool exhaustion is usually exacerbated by poor cancellation support; code that blocks on asynchronous code usually does not support cancellation. So the HTTP requests themselves will often time out, but the web app instance won't even be able to respond to those cancellations.

    There are areas where we aren't using .ConfigureAwait(false) and we should be.

    This one doesn't matter as much. I'd focus on removing Wait/Result. ConfigureAwait(false) on ASP.NET is just a very minor optimization.

    These timeouts seem to be a symptom of the underlying issue and NOT the cause from what we can see. Please correct me if I'm wrong about this.

    Timeouts are a natural byproduct of thread pool exhaustion, so that fits. When the thread pool is exhausted, it is often unable to complete asynchronous operations within timeout periods. This affects all asynchronous operations (even in fully-asynchronous handlers). The entire instance slows to a crawl.

    The biggest question that is stumping us is the fact that it takes down the entire app service (not the plan, just service). Even if we manually scale-out during these incidents, it doesn't help.

    There's also a load balancer in play. As each instance becomes unresponsive (technically not dead, but so extremely slow it may as well be), it appears "full", and the load balancer will generally try to balance requests. This can cause a cascading failure, as each instance dies, the others have to take more load, quickly causing them all to fail.

    At a constant rate of requests, adding a single instance or two isn't going to help, because they'll be immediately overwhelmed; all the other instances are "full of requests" according to the load balancer, so any new instances have to handle the entire ongoing load.

    This is why I recommend scaling by 2.5x or 3x when this occurs. 1x is what you already have; 2x will just duplicate the scenario (the same number of new instances trying to handle the same load); 2.5x or 3x gives you some breathing room and a chance to recover.

    The things we've done to try and mitigate these outages is slowly go through our code base and get these async/await issues taken care of. Try to mitigate bots from crawling us. Find our highest throughput request paths and make them async. And lots of reviewing of logs using New Relic.

    These all sound like the right approach. Reducing unnecessary traffic (bots) while fixing the underlying issue.

    Personally, I'd just keep it scaled at 2x expected load (avoiding thread pool exhaustion in the first place), and when accounting complains, explain that this is the interest payment on the company's technical debt. I like to imagine myself as a cranky, sarcastic old man when saying this; but in reality I'm too nice to actually say it in a mean way.

    This is the general advice I give: if you're doing sync-over-async, you pay the price of it by scaling higher than you would ideally need to.

    Eventually, I hope you will be able to pay down the tech debt and then you can properly scale the way the cloud was designed. Sometimes the business decides it isn't worth it, though.