I recently benchmarked the .NET 4 garbage collector, allocating intensively from several threads. When the allocated values were recorded in an array, I observed no scalability just as I had expected (because the system contends for synchronized access to a shared old generation). However, when the allocated values were immediately discarded, I was horrified to observe no scalability then either!
I had expected the temporary case to scale almost linearly because each thread should simply wipe the nursery gen0 clean and start again without contending for any shared resources (nothing surviving to older generations and no L2 cache misses because gen0 easily fits in L1 cache).
For example, this MSDN article says:
Synchronization-free Allocations On a multiprocessor system, generation 0 of the managed heap is split into multiple memory arenas using one arena per thread. This allows multiple threads to make allocations simultaneously so that exclusive access to the heap is not required.
Can anyone verify my findings and/or explain this discrepancy between my predictions and observations?
Not a complete answer to the question, but just to clear up some misconceptions: the .NET GC is only concurrent in workstation mode. In server mode, it uses stop-the-world parallel GC. More details here. The separate nurseries in .NET are primarily to avoid synchronisation on allocation; they are nevertheless part of the global heap and cannot be collected separately.