c++multithreading performance x86 lock-free

Locking-Free queue, loaded vs unloded CPU

My CPU is Corei7 ( 4 physical cores/ 8 logical). I am testing my implementation of free-locking queue. What is the test? I just create a lot of threads ( "pusher" and "popper") and they push/pop elements. I noticed that it works much faster when... CPU is loaded. So, when the CPU isn't relatively loaded it works slower ( relatively much). And, when it is loaded it works faster.

How to understand it? I think that it is caused by the fact, that "popper" and "pusher" have to race to "head/"tail". ( I mean incrementation of node because of the memory managment). If there is less popper/pusher then counter is lower. But, please note that it works free-locking in fact ( I think so :) )
Does it mean that I should send thread in some situation to sleep for 2 ms ? Maybe 10 ms ( it's so long time for CPU). I am not sure.

Solution

Bouncing cache lines between cores is expensive. It sounds reasonable that one core could push/pop more than 4x faster than 4 cores contending with each other while they try to modify the same memory.

So it sounds like the problem is in deciding what changes in the total wall-clock time or total CPU time of all the threads tell you about whether your code is good or not.

To put it another way: You're testing the maximum-contention case, where your threads are spending all their time pushing and popping, and not doing any other work. In real code that used this queue, the other work done by a thread would throttle the access rate to your queue, probably a lot, so threads would step on each other's toes a lot less. (Contention probably causes a significant performance hit with cmpxchg loops, because only one CPU will succeed each time, and all the rest will retry every time.)

Related: Locks Aren't Slow; Lock Contention Is (Jeff Preshing) makes the same points for testing parallel algorithms that use locks in high vs. low contention cases.

Maybe try benchmarking with some threads doing read-only access

Lock-free algorithms really shine when a lot of the accesses are reads. I guess a queue is normally popped, not just read, so maybe this doesn't make sense for real use. But I bet you'd see different results if some of your threads were just reading the shared queue instead of updating it. (e.g. traverse it from head to tail as a linked list).

Another interesting thing to try, in the write code: add a pause instruction ( _mm_pause()) before a load from shared memory somewhere in your benchmark, to avoid memory-ordering mis-speculation. (i.e. where the CPU speculatively uses a value loaded before the load is allowed to become globally visible, and then has to roll back when it turns out the value was changed by another core by the time the load was supposed to become globally visible). Keep in mind that pause sleeps for ~5 cycles on Haswell, but ~100 cycles on Skylake, so even if you see a speedup from it in a non-synthetic benchmark on Haswell, it's probably a bad idea to leave it in for real use on future CPUs.

Note that pause isn't useful before locked read-modify-write instructions; they're already expecting writes from other cores.

Normally you do a relaxed load an then a cmpxchg, so I'm suggesting putting a pause before the load.