.net azure-web-app-service azure-app-service-plans

Intermittent 503 Service Unavailable responses in azure app services

I tried reading all the possible articles about this topic but haven't seen any good solutions. On Microsoft site they mention scaling up or turning on auto-heal functionality (those seem a little more like a workaround).

Our solution consists of multiple .NET APIs that communicate with each other through http calls besides using async message broker.

We were recently informed by our client that there are short periods of time where the system stops responding.

Digging into the logs we see that for short periods of time (15-30 minutes) services start returning 503 responses for a percentage of requests. Looking more closely at the system and our logging, we also see that several incoming calls from our frontend clients are getting 503 responses from our App Gateway. We see the same symptoms on multiple App Service Plans.

We haven't seen any noticeable change in traffic when those problems occurred. We also haven't seen resource (memory, cpu) exhaustion. We are running on Isolated app service plans.

This doesn't happen often (maybe every 1-2 days) but if it occurs during peak usage it can highly disrupt user experience.

I noticed that we run only 1 instance of app service plan with autoscale functionality. We tried raising the number to 2 for one of the app service plans but still apps under that service plan returned 503s.

Any ideas what could be the cause or what further investigation steps to take ?

Edit:

I was able to find the 503 responses logged in one of our app services (picture below). It mentions load-balancer being an issue and the http sub status code according to the table provided by Bryan Trach says ,,Exception in rewrite provider (likely caused by SQL)". However when I tried googling for any kinds of solutions to this I got nothing.

Solution

So we finally figured out what was wrong. Posting it here because I hope it will help someone else.

For us the issue was that some of the outgoing traffic was being blocked in Azure firewall (by default we block all unknown traffic that is not listed as needed by Microsoft). When we whitelisted some of the addresses the issue seemed to have gone away. What is weird that those addresses are not listed in any of the guidelines provided by Microsoft and we still didn't get a good answer from their support as to the reason they are actually needed. We are currently using Azure App Service Environment v2 and will probably migrate to v3. However not sure if this would help.