c#multithreading cpu-cache interlocked mesi

Even faster inexpensive thread-safe counter?

I've read this topic: C# Thread safe fast(est) counter and have implemented this feature in my parallel code. As far as I can see it all works fine, however it has measurably increased the processing time, as in 10% or so.

It's been bugging me a bit, and I think the problem lies in the fact that I'm doing a huge number of relatively cheap (<1 quantum) tasks on small data fragments which are well partioned and probably make good use of cache locality, thus running optimally. My best guess, based on what little I know about MESI, is that x86 LOCK prefix in Interlocked.Increment pushes cacheline into Exclusive mode and forces a cache miss on other cores and forces cache reload on every single parallel pass just for the sake of incrementing this counter. With 100ns-ish delay for cache miss and my workload it seems to add up. (Then again, I could be wrong)

Now, I don't see a way around it, but maybe I am missing something obvious. I was even thinking about using n counters (corresponding to degree of parallelization) and then incrementing each on specific core, however it seems unfeasible (detecting which core I am on will probably be more expensive, not to mention an elaborate if/then/else structure and messing up with the execution pipeline). Any ideas on how to break this beast? :)

Solution

Operations from multiple cores on the same cache line contend in hardware. This is true for locked and for regular memory access. This is a real problem. Contending accesses do not scale at all when more cores are added. Scaling typically is hard negative.

You need to use multiple cache lines with each core using its own most of the time.

You could use a ThreadLocal<Holder> and class Holder { public int I; } for that. ThreadLocal supports enumerating all instances that have been created so that you can sum them. You also could use a struct that is padded to the cache line size. That's safer.

Note, that it is not important to use one counter per core. Per-thread is good enough because time quantums are incredibly long compared to increment operations. A few bad accesses are not a performance problem.

A faster option would be to use a Holder[]. Each thread draws a random array index once and then accesses that holder object. Array indexing is faster than thread local access. If the number of holder instances you use is much bigger (10x) than the number of threads there will be little contention. Most writes will go the same already cached line.

Instead of a random index you can use a List<Holder> and add items as more threads join the processing.