.net memory out-of-memory perfmon working-set

WorkingSet Spike just before OutOfMemoryException

I am investigating an incident where an OutOfMemoryException has been thrown in production, for a "traditional" .NET server application. My purpose is to interpret a specific portion of the data gathered through Performance Monitor and seek some advice on how to move on. Let me start with a list of facts:

The process had been running for over 20 days until the crash.
It crashed because an exception of type System.OutOfMemoryException was thrown.
There have been similar incidents in the past. Similarly, it takes a long time for the application to crash.
The process had been monitored through Performance Monitor by the following counters: # Bytes in all Heaps, % Processor Time, Private Bytes, Working Set.
We cannot capture any memory dumps at the production environment, and we haven't been able to reproduce it.

In the first screenshot, you can see the overall behavior of the counters in a span of 7 days. Things are pretty much stable. The second screenshot shows the behavior over the last minute, around the crash. The OutOfMemoryException was logged on 3:13:49PM.

7day behaviour Last minute around the crash

My questions are: 1. Any ideas what does the sudden increase of Working Set mean? It was overall stable at 650ish MB, and in 10 seconds it climbed up to 1,3GB. 2. Should I focus on finding something that triggered the OOM just before the crash, or could it be an accumulative factor? As you've seen, Private Bytes and Bytes on all Heaps are pretty much stable.

Solution

These kinds of problems are exceedingly difficult to diagnose. It is quite possible that what is happening is not the result of a single condition that triggers the behaviour, but a set of simultaneous conditions.

Here's what we know:

No cumulative problem indicated: If the problem was cumulative, we would expect to see some sign of that of the 20 day period leading up to the event. This does not mean that the preceding operation can be ignored. It is possible that some of the conditions that trigger the behaviour are staged, and start earlier on. This is something we cannot know with the information available.
Heaps are stable: The Private Bytes measure tells us how much memory has been reserved (not touched, as stephbu suggested). Bytes-in-all-Heaps tells us how much of the reserved memory is current allocated according to the memory manager (GC). Since both of these are stable, it would seem that the problem isn't necessarily a memory leak. The danger is that we only have 10 seconds of interesting data, and since GC is usually fairly passive, it isn't clear how accurate those statistics would be (particular with the wonky working set).
Working set indicates thrashing: The Working Set tells us how much physical memory the OS wants to keep paged-in to ensure reasonable performance. A growing working set indicates thrashing. A growing working set is normally associated with two things:
- increased allocation rate
- increased object longevity (often temporary)
Increased object longevity is not indicated, because the heaps are not showing growth. Increased allocation rate is possible, but the objects are still short-lived (since a leak is not indicated).

These observations suggest to me that some rare event (or set of events) is triggering a condition in which there is:

a high allocation rate
of moderately large objects
which are not very long-lived
GC is thrashing as a result

There are other reports of these conditions causing OutOfMemoryExceptions. I'm not all that certain why it happens. If you are running a 32-bit environment, then a possible reason is fragmentation of the address space. This can happen if the GC cannot obtain contiguous pages from the OS.

Another possibility (which I cannot verify) is that the GC requests the OS to not page-out parts of the heap it is working on. If the number of locked pages gets high, an out-of-memory might result. This idea is almost total speculation as I do not know enough about the internals of Microsofts GC implementation.

I don't have any better explanations right now, but I'd definitely like a better explanation if anyone can provide one.

Finally, you might want to verify that a reasonable Latency Mode is enabled. If this was the problem, I think we would have seen an escalation of Bytes-in-all-Heaps -- so it's probably ok.

Can you check what variable is indicated by the dashed line in the second chart? If it is processor use, then it is consistent with thrashing. As the need to page-in content more frequently increases, disk IO should increase, and (at a certain point) percentage processor use should decline, because everything is waiting for the disk. This is just an extra detail -- if the processor use doesn't decline excessively, thrashing is still a posibility. This is because parts of the software might still exhibit good locality and be able to make progress.