Spinning on an atomic load with acquire consistency vs. relaxed consistency

Consider the code below:

// Class member initialization:
std::atomic<bool> ready_ = false;

...

// Core A:
while (!ready_.load(std::memory_order_acquire)) {
  // On x86, you would probably put a `pause` instruction here.
}
// Core A now accesses memory written by Core B.

...

// Core B:
// Core B writes memory.
ready_.store(true, std::memory_order_release);

Assume that Core A and Core B are two different physical cores (i.e., they are not two hyperthreads co-located on the same physical core). Does Core A's code above have worse performance than the code below or equal performance? Note that Core A is simply doing a load; this is not the classic compare-exchange example that involves a write. I am interested in the answer for several architectures.

// Core A:
while (!ready_.load(std::memory_order_relaxed)) {
  // On x86, you would probably put a `pause` instruction here.
}
std::atomic_thread_fence(std::memory_order_acquire);
// Core A now accesses memory written by Core B.

The mailbox code on this reference page alludes to the bottom code having better performance since the bottom code avoids "unnecessary synchronization." However, the mailbox code is iterating over many atomics, so the synchronization overhead of acquire consistency is a problem since you could use relaxed consistency to avoid ordering constraints on mailboxes that are not yours. It is not clear to me what the performance impact is of spinning on a single acquire load.

Solution

There are two ways in which the first code could be less efficient than the second, at least on some hypothetical architecture. On x86, my guess would be that they compile to the same code.

The first issue is that an atomic load might affect the performance of other processors. On alpha, which is often a good "outlier" case in studying memory consistency, you'd be issuing a memory barrier instruction over and over again, which could potentially lock the memory bus (on a non-NUMA machine), or do something else to force write atomicity of stores by two other CPUs.

The second issue is that a barrier affects all previous loads, not just the load of ready_. So maybe on a NUMA machine, ready_ actually hits in the cache because there is no contention and your CPU is already caching it in exclusive mode, but some previous load is waiting for the memory system. Now you have to stall the CPU to wait for the previous load instead of potentially continuing to execute instructions that don't conflict with the stalled load. Here's an example:

int a = x.load(memory_order_relaxed);
while (!ready_.load(std::memory_order_relaxed))
  ;
std::atomic_thread_fence(std::memory_order_acquire);
int b = y;

In this case the load of y could potentially stall waiting for x, whereas if the load of ready_ had been done with acquire semantics, then the load of x could just continue in parallel until the value is needed.

For the second reason, you might actually want to structure your spinlock differently. Here is how Erik Rigtorp suggests implementing a spinlock on x86, which you could easily adapt to your use case:

  void lock() {
    for (;;) {
      if (!lock_.exchange(true, std::memory_order_acquire)) {
        break;
      }
      while (lock_.load(std::memory_order_relaxed)) {
        __builtin_ia32_pause();
      }
    }
  }