Spinlock implementation reasoning

I want to improve the performance of a program by replacing some of the mutexes with spinlocks. I have found a spinlock implementation in

http://www.boost.org/doc/libs/1_36_0/boost/detail/spinlock_sync.hpp

which I intend to reuse. I believe this implementation is safer than simpler implementations in which threads keep trying forever like the one found here

http://www.boost.org/doc/libs/1_54_0/doc/html/atomic/usage_examples.html#boost_atomic.usage_examples.example_spinlock.implementation

But i need to clarify some things on the yield function found here

http://www.boost.org/doc/libs/1_36_0/boost/detail/yield_k.hpp

First of all I can assume that the numbers 4,16,32 are arbitrary. I actually tested some other values and I have found that I got best performance in my case by using other values.

But can someone explain the reasoning behind the yield code. Specifically why do we need all three

BOOST_SMT_PAUSE
sched_yield and
nanosleep

Solution

Yes, this concept is known as "adaptive spinlock" - see e.g. https://lwn.net/Articles/271817/.

Usually the numbers are chosen for exponential back-off: https://geidav.wordpress.com/tag/exponential-back-off/

So, the numbers aren't arbitrary. However, which "numbers" work for your case depend on your application patterns, requirements and system resources.

The three methods to introduce "micro-delays" are designed explicitly to balance the cost and the potential gain:

zero-cost is to spin on high-CPU, but it results in high power consumption and wasted cycles
a small "cheap" delay might be able to prevent the cost of a context-switch while reducing the CPU load relative to a busy-spin
a simple yield might allow the OS to avoid a context switch depending on other system load (e.g. if the number of threads < number logical cores)

The trade-offs with these are important for low-latency applications where the effect of a context switch or cache misses are significant.

TL;DR

All trade-offs try to find a balance between wasting CPU cycles and losing cache/thread efficiency.