I have a managed code Windows Service application that is crashing occasionally in production due to a managed StackOverFlowException. I know this because I've run adplus in crash mode and analyzed the crash dump post mortem using SoS. I have even attached the windbg debugger and set it to "go unhandled exception".
My problem is, I can't see any of the managed stacks or switch to any of the threads. They're all being torn down by the time the debugger breaks.
I'm not a Windbg expert, and, short of installing Visual Studio on the live system or using remote debugging and debugging using that tool, does anyone have any suggestions as to how I can get a stack trace out of the offending thread?
Here's what I'm doing.
!threads
...
XXXX 11 27c 000000001b2175f0 b220 Disabled 00000000072c9058:00000000072cad80 0000000019bdd3f0 0 Ukn System.StackOverflowException (0000000000c010d0)
...
And at this point you see the XXXX ID indicating the thread is quite dead.
Once you've hit a stack overflow, you're pretty much out of luck for debugging the problem - blowing your stack space leaves your program in a non-deterministic state, so you can't rely on any of the information in it at that point - any stack trace you try to get may be corrupted and can easily point you in the wrong direction. Ie, once the StackOverflowException occurs, it's too late.
Also, according to the documentation you can't catch a StackOverflowException from .Net 2.0 onwards, so the other suggestions to surround your code with a try/catch for that probably won't work. This makes perfect sense, given the side effects of a stack overflow (I'm surprised .Net ever allowed you to catch it).
Your only real option is to engage in the tedium of analyzing the code, looking for anything that could potentially cause a stack overflow, and putting in some sort of markers so you can get an idea where they occur before they occur. Eg, obviously any recursive methods are the first place to start, so give them a depth counter and throw your own exception if they get to some "unreasonable" value that you define, that way you can actually get a valid stack trace.