Search code examples
c#asp.netazureiiskentico

IIS app pool crashing on Azure load-balanced VMs


We have a new ASP.NET website running on a pair of load balanced Azure VMs. The website is fairly simple and uses Kentico CMS. Twice in the 24 hours since going live the application pool on both web servers has suddenly stopped (within 5-10 minutes of each other) causing 503: Service unavailable errors.

Looking at Windows system logs I see the error which caused the problem:

Application pool '[[NAME]]' is being automatically disabled due to a series of failures in the process(es) serving that application pool.

Leading up to this are a series of warnings:

A process serving application pool '[[NAME]]' suffered a fatal communication error with the Windows Process Activation Service. The process id was '[[PROCESS ID]]'. The data field contains the error number.

Evidently this is IIS's rapid-fail protection kicking in. What's not clear is how to find the cause of this "fatal communication error".

After some web searching I've installed the Debug Diagnostics Tool which has helped me identify that in every case the relevant process was the IIS worker process (w3wp.exe). This tool is new to me and unfortunately the only time the problem occurred since I installed it, no dumps were generated. However, its logs contain a lot of messages like this:

First chance exception - 0xe0434352 caused by thread with System ID: [[ID]]

The frustrating thing is that I don't know what steps to take to replicate the error conditions. It never occurred in UAT in a very similar environment, even under load test. Here are some facts about my setup:

  • ASP.NET version = 4.5.2
  • Application pool running with identity set to a domain account with modify permission on the website directory
  • Application set with max one worker process

Any advice much appreciated.

* UPDATE 1 *

I now have DebugDiag dump generated by the "fatal communication error" warning event. Dump summary reads:

Dump Summary
------------
Process Name:   w3wp.exe : C:\Windows\SysWOW64\inetsrv\w3wp.exe
Process Architecture:   x86
Exception Code: 0xC00000FD
Exception Information:  The thread used up its stack.
Heap Information:   Present

Solution

  • In the end I tracked this down to a bug in my code. Under very edge-case circumstances the CMS was returning an empty Guid instead of an actual ID which was causing a stack overflow in a recursive method.

    The 0xC00000FD exception code I posted above is actually a stack overflow exception, so once I knew that and downloaded the Debug Diagnostcs dump file I was able to replicate the crash scenario locally. That tool, by the way, is incredibly powerful and was able to demonstrate the exact conditions of the crash.

    All I can say to people who arrive here with similar issue is - firstly, don't assume the issue is not with your code! And secondly, use Debug Diagnostcs.