c++multithreading gcc memory-barriers spinlock

Is memory barrier or atomic operation required in a busy-wait loop?

Consider the following spin_lock() implementation, originally from this answer:

void spin_lock(volatile bool* lock)  {  
    for (;;) {
        // inserts an acquire memory barrier and a compiler barrier
        if (!__atomic_test_and_set(lock, __ATOMIC_ACQUIRE))
            return;

        while (*lock)  // no barriers; is it OK?
            cpu_relax();
    }
}

What I already know:

volatile prevents compiler from optimizing out *lock re-read on each iteration of the while loop;
volatile inserts neither memory nor compiler barriers;
such an implementation actually works in GCC for x86 (e.g. in Linux kernel) and some other architectures;
at least one memory and compiler barrier is required in spin_lock() implementation for a generic architecture; this example inserts them in __atomic_test_and_set().

Questions:

Is volatile enough here or are there any architectures or compilers where memory or compiler barrier or atomic operation is required in the while loop?

1.1 According to C++ standards?

1.2 In practice, for known architectures and compilers, specifically for GCC and platforms it supports?
Is this implementation safe on all architectures supported by GCC and Linux? (It is at least inefficient on some architectures, right?)
Is the while loop safe according to C++11 and its memory model?

There are several related questions, but I was unable to construct an explicit and unambiguous answer from them:

Q: Memory barrier in a single thread

In principle: Yes, if program execution moves from one core to the next, it might not see all writes that occurred on the previous core.
Q: memory barrier and cache flush

On pretty much all modern architectures, caches (like the L1 and L2 caches) are ensured coherent by hardware. There is no need to flush any cache to make memory visible to other CPUs.
Q: Is my spin lock implementation correct and optimal?
Q: Do spin locks always require a memory barrier? Is spinning on a memory barrier expensive?
Q: Do you expect that future CPU generations are not cache coherent?

Solution

Is volatile enough here or are there any architectures or compilers where memory or compiler barrier or atomic operation is required in the while loop?

will the volatile code see the change. Yes, but not necessarily as quickly as if there was a memory barrier. At some point, some form of synchronization will occur, and the new state will be read from the variable, but there are no guarantees on how much has happened elsewhere in the code.

1.1 According to C++ standards?

From cppreference : memory_order

It is the memory model and memory order which defines the generalized hardware that the code needs to work on. For a message to pass between threads of execution, an inter-thread-happens-before relationship needs to occur. This requires either...

A synchronizes-with B
A has a std::atomic operation before B
A indirectly synchronizes with B (through X).
A is sequenced before X which inter-thread happens before B
A interthread happens before X and X interthread happens before B.

As you are not performing any of those cases there will be forms of your program where on some current hardware, it may fail.

In practice, the end of a time-slice will cause the memory to become coherent, or any form of barrier on the non-spinlock thread will ensure that the caches are flushed.

Not sure on the causes of the volatile read getting the "current value".

1.2 In practice, for known architectures and compilers, specifically for GCC and platforms it supports?

As the code is not consistent with the generalized CPU, from C++11 then it is likely this code will fail to perform with versions of C++ which try to adhere to the standard.

From cppreference : const volatile qualifiers Volatile access stops optimizations from moving work from before it to after it, and from after it to before it.

"This makes volatile objects suitable for communication with a signal handler, but not with another thread of execution"

So an implementation has to ensure that instructions are read from the memory location rather than any local copy. But it does not have to ensure that the volatile write is flushed through the caches to produce a coherent view across all the CPUs. In this sense, there is no time boundary on how long after a write into a volatile variable will become visible to another thread.

Also see kernel.org why volatile is nearly always wrong in kernel

Is this implementation safe on all architectures supported by GCC and Linux? (It is at least inefficient on some architectures, right?)

There is no guarantee the volatile message gets out of the thread which sets it. So not really safe. On linux it may be safe.

Is the while loop safe according to C++11 and its memory model?

No - as it doesn't create any of the inter-thread messaging primitives.