How `memory_order_relaxed` is enough in TTAS spinlock for Arm64?

Consider the following implementation of spinlock (first link in google on query "c++ spinlock implementation"):

struct spinlock {
std::atomic<bool> lock_ = {0};

void lock() noexcept {
  for (;;) {
    if (!lock_.exchange(true, std::memory_order_acquire)) {
      return;
    }
    while (lock_.load(std::memory_order_relaxed)) { // what's the guarantee????
      asm volatile ("yield\nyield\nyield");
    }
  }
}

bool try_lock() noexcept {
  return !lock_.load(std::memory_order_relaxed) &&  // why not acquire???
         !lock_.exchange(true, std::memory_order_acquire);
}

void unlock() noexcept {
  lock_.store(false, std::memory_order_release);
}
};

However, it looks incorrect to me: on Arm64, memory_order_relaxed reads are not guaranteed to flush the invalidation queue (unlike X86 with its TSO). Is it a bug or am I wrong?

Solution

It's a common misconception that relaxed memory ordering is insufficient to ensure that atomic writes become visible at all, or that atomic reads eventually observe them. It's true that non-atomic writes and reads have this problem, because the compiler may optimize them away altogether. But for atomic operations, we have [intro.progress p18] (using C++20):

An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

(People sometimes argue about this being "should" instead of "must", but the fact is that any implementation that didn't adhere to this would be unusable. My feeling is that they put "should" just because the statement of the rule is less formally rigorous than the rest of the memory model, and they didn't want anyone to press them for a precise mathematical formulation. And if you're not satisfied, then changing it to acquire or seq_cst won't make it any better, because this is the only finite-time visibility guarantee in the standard and it applies equally to all memory orderings.)

So when another thread releases the lock, eventually the relaxed load in your loop must observe it. Thus there is no problem with correctness. And a typical machine will ensure that this happens, not merely in a finite period of time, but actually without any unnecessary delay.

This is also promised at the machine level by the ARM architecture: if a store is made to some memory location, then loads from that location eventually observe the value stored, even without the use of additional memory barriers. This applies at least to observers in the same Shareability Domain, but two threads of the same process will always run on cores in the same Inner Shareable domain, and memory shared between them will always be Normal Inner Shareable.

A precise statement of this property is somewhat hard to find in the ARMv8 spec, but we do have in B2.7.1 "A write to a memory location with the Normal attribute completes in finite time", and further down, "Each Inner Shareability domain contains a set of observers that are data coherent for each member of that set for data accesses with the Inner Shareable attribute made by any member of that set.". Unfortunately they do not seem to give a precise definition of "data coherent", but it surely must include "loads eventually observe stores".

With regard to the paper cited in a recent comment thread that you participated in: even if a relaxed load doesn't cause an immediate flush of the invalidation queue, that doesn't mean that an invalidate message will just sit in that queue indefinitely. The core is always going to process those messages as quickly as it can, so eventually (normally quite soon), the cache line will be invalidated anyway. At that point, subsequent loads will have to request the line from the core where it was modified, and then the new value will be observed.

Similarly, on the writer's side, even if a store does get put in a store buffer instead of being committed immediately, it isn't going to just sit there; the core processes those stores as promptly as it can, so it will become globally visible in the near future without needing any barriers or other further instructions.