Search code examples
reactjsgoogle-chromeaxiosazure-web-app-servicehttp-status-code-500

Why does my azure app service hosted react-based PWA randomally start returning a 500 error for ALL POST calls (GET works fine)? Fix: Restart browser?


Problem Background We have a react-based (version 16.14.0) PWA that is hosted on Microsoft Azure app service plans (premium tier).

We've recently seen an increase in a sporadic issue where we are seeing 500 errors when hitting some backend endpoints. Initial thoughts is it's a backend problem, however, I can't see how it is and need some new theories to try out :|

Randomly, when connecting to the app, we notice network errors, from one (or more) of the three possible backends the UI talks to will be showing as a 500 error. Whilst in this "state", ALL POST errors will fail to this particular backend.

It is ONLY affects POST endpoints. GET endpoints successfully continue to work to these backends (so ruling out DNS issues). ALL POST endpoints return a 500 (there was one exception POST request - but we concluded this was because it didn't have a payload!). The OPTIONS (preflight) requests for these POST requests successfully return a 204, but it's the actual request that gets a 500.

In our test environments, we are only hosted on one backend instance on the app service plan (which hasn't changed during the failing tests), so it's not a load balancing issue with a dodgy node in the pool.

Azure app service monitoring tools are limited... but I cannot see any activity to suggest these calls are actually ever making it to the backend. There's nothing in application insights, nothing in the failed request logs.

It affects all our backends. The UI and C# .NET Framework 4.8.1 backend is hosted on one app service plan, and we see these failures sporadically here. The UI also fails against two other app services that are hosted in .NET 8 (that run on the same app service plan).

Browser error received is: Failed to load resource: net::ERR_FAILED

I have two questions... Firstly, why does it only affect POST? As I'm hoping understanding this may then help identify the root cause of the problem. Secondly, do you think it could be a front-end caused 500? I've seen evidence that some things within react (e.g. react-router) can return 500s. I always thought a 500 response would HAVE to come from the backend. But as of now, I'm not that sure, so seeking some clarification.

I would raise a support ticket with Microsoft, but after raising many tickets for far simpler problems in the past, (e.g. when their services go down), I've realised this is a complete farce and waste of time which I should be using to identify the problem.

Many thanks in advance for any questions, suggestions and answers.

What we've tried:

  1. Disabling browser cache through chrome devtools
  2. Bypass for network chrome option for the service worker
  3. Restarting the backend app service - problem persists
  4. Different browsers (also affects Firefox)
  5. Older versions of chrome (through selenium tests)
  6. Fiddler - we can see the 500 request in fiddler
  7. Upgrading axios package to latest
  8. Migrating javascript backend calls from xhr to fetch
  9. Setting cache to no-store on the fetch requests
  10. Ruled out cors issues because we also see this error on same-origin site (and if it was cors, it would always be failing, not just randomly)
  11. Clearing all site data, cookies, cache and performing a hard cache reload
  12. Using the same http headers, body, cookies etc through postman - works fine
  13. Trying to identify WHEN the problem started - at least 3 months ago, but has become far more regular in the last few weeks
  14. Affecting multiple regions within Azure - we see this issue in both North Europe hosted infra, and UK South hosted infra
  15. Affecting users from different geographies (we have selenium tests that run in US West that connect to our UK South site, and they have also seen these random failures)

What we are yet to try, but is on the list:

  1. Migrating away from windows app service and onto a linux (kestral) based backend for the .NET 8 backends, in case it's an IIS issue
  2. Clear redis cache of all identity tokens
  3. Updating react and all the npm packages (massive task)

What can "fix" the error? Restarting the browser Wait X minutes, which seems to be anywhere from 5 to 15 minutes (I'm still trying to identify how long X is). Use a different browser session (e.g. open chrome in incognito mode)


Solution

  • This was a problem with Azure.

    Current workaround is to change your app service to use http 1.1.

    The App Service product team are currently investigating the problem.

    The following post seems to be being updated by them (more so than my support ticket which is still stuck with our CSP, CDW)

    https://learn.microsoft.com/en-us/answers/questions/1687258/our-azure-app-service-application-started-to-exper?page=1&orderby=Helpful#answers