Our multithreaded application does a lengthy computational loop. On average it takes about 29 sec for it to finish one full cycle. During that time, the .NET performance counter % time in GC measures 8.5 %. Its all made of Gen 2 collections.
In order to improve performance, we implemented a pool for our large objects. We archieved a 100% reusement rate. The overall cycle now takes only 20 sec on average. The "% time in GC" shows something between 0.3...0.5%. Now the GC does only Gen 0 collections.
Lets assume, the pooling is efficiently implemented and neglect the additional time it takes to execute. Than we got a performance improvement of roughly 33 percent. How does that relate to the former value for GC of 8.5%?
I have some assumptions, which I hope can be confirmed, adjusted and amended:
1) The "time in GC" (if I read it right) does measure the relation of 2 time spans:
What is not included into the second time span, would be the overhead of stopping and restarting the worker threads for the blocking GC. But how could that be as large as 20% of the overall execution time?
2) Frequently blocking the threads for GC may introduce contention between the treads? It is just a thought. I could not confirm that via the VS concurrency profiler.
3) In contrast to that, it could be confirmed that the number of page misses (performance counter: Memory -> Page Faults/sec) is significantly higher for the unpooled application (25.000 per second) than for the application with the low GC rate (200 per second). I could imagine, this would cause the great improvement as well. But what could explain that behaviour? Is it, because frequent allocations are causing a much larger area from the virtual memory address space to be used, which therefore is harder to keep into the physical memory? And how could that be measured to confirm as the reason here?
BTW: GCSettings.IsServerGC = false, .NET 4.0, 64bit, running on Win7, 4GB, Intel i5. (And sorry for the large question.. ;)
Pre-allocating the objects improves concurrency, the threads no longer have to enter the global lock that protects the garbage collected heap to allocate an object. The lock is held for a very short time, but clearly you were allocating a lot of objects so it isn't unlikely that threads fight for the lock.
The 'time in GC' performance counter measures the percentage of cpu time spent collecting instead of executing regular code. You'll can get a big number if there are a lot of gen# 2 collections and the rate at which you allocate objects is so great that background collection can no longer keep up and the threads must be blocked. Having more threads makes that worse, you can allocate more.