arm cpu-architecture micro-optimization memory-barriers

Should combining memory fence for mutex acquire-exchange loop (or queue acquire-load loop) be done or should it be avoided?

Assume a repeated acquire operation, that tries to load or exchange a value until the observed value is the desired value.

Let's take cppreference atomic flag example as a starting point:

void f(int n)
{
    for (int cnt = 0; cnt < 100; ++cnt) {
        while (lock.test_and_set(std::memory_order_acquire))  // acquire lock
             ; // spin
        std::cout << "Output from thread " << n << '\n';
        lock.clear(std::memory_order_release);               // release lock
    }
}

Now let's consider enhancements to this spinning. Two well-known are:

Don't spin forever, instead go to OS wait at some point;
Use an instruction, such as pause or yield instead of no-operation spinning.

I can think of a third, and I'm wondering if it ever makes sense. We can use std::atomic_thread_fence for acquire semantic:

void f(int n)
{
    for (int cnt = 0; cnt < 100; ++cnt) {
        while (lock.test_and_set(std::memory_order_relaxed))  // acquire lock
             ; // spin
        std::atomic_thread_fence(std::memory_order_acquire);  // acquire fence
        std::cout << "Output from thread " << n << '\n';
        lock.clear(std::memory_order_release);               // release lock
    }
}

I expect that to be no change for x86.

I'm wondering:

Is there benefits or drawbacks from this change on platforms where there is a difference (ARM)?
Is there any interference with the decision to use or not to use yield instruction?

I'm not only interested in atomic_flag::clear / atomic_flag::test_and_set pair, I'm also interested in atomic<uint32_t>::store / atomic<uint32_t>::load pair.

Possibly changing to relaxed load could make sense:

void f(int n)
{
    for (int cnt = 0; cnt < 100; ++cnt) {
        while (lock.test_and_set(std::memory_order_acquire))  // acquire lock
             while (lock.test(std::memory_order_relaxed))
                 YieldProcessor(); // spin
        std::cout << "Output from thread " << n << '\n';
        lock.clear(std::memory_order_release);               // release lock
    }
}

Solution

Yes, the general idea of avoiding an acquire barrier inside the failure retry path is possibly useful, although performance in the failure case is barely relevant if you're just spinning. pause or yield save power. On x86, pause also improves SMT friendlyness, and avoids memory-order mis-speculation when leaving the loop after another core modified the memory location you're spinning on.

But that's why CAS has separate memory_order parameters for success and failure. Relaxed failure could let the compiler only barrier on the leave-the-loop path.

atomic_flag test_and_set doesn't have that option, though. Doing it manually potentially hurts ISAs like AArch64 that could have done an acquire RMW and avoided an explicit fence instruction. (e.g. with ldarb)

Godbolt: Original loop with lock.test_and_set(std::memory_order_acquire):

# AArch64 gcc8.2 -O3
.L6:                            # do{
    ldaxrb  w0, [x19]           # acquire load-exclusive
    stxrb   w1, w20, [x19]      # relaxed store-exclusive
    cbnz    w1, .L6            # LL/SC failure retry
    tst     w0, 255
    bne     .L6             # }while(old value was != 0)
  ... no barrier after this

(And yes, it looks like a missed optimization that it's only testing the low 8 bits with tst instead of just cbnz w1, .L6)

while(relaxed RMW) + std::atomic_thread_fence(std::memory_order_acquire);

.L14:                          # do {
    ldxrb   w0, [x19]             # relaxed load-exclusive
    stxrb   w1, w20, [x19]        # relaxed store-exclusive
    cbnz    w1, .L14             # LL/SC retry
    tst     w0, 255
    bne     .L14               # }while(old value was != 0)
    dmb     ishld         #### Acquire fence
   ...

It's even worse for 32-bit ARMv8 where dmb ishld isn't available, or compilers don't use it. You'll get a dmb ish full barrier.

Or with `-march=armv8.1-a`

.L2:
    swpab   w20, w0, [x19]
    tst     w0, 255
    bne     .L2
    mov     x2, 19
  ...

vs.

.L9:
    swpb    w20, w0, [x19]
    tst     w0, 255
    bne     .L9
    dmb     ishld                   # acquire barrier (load ordering)
    mov     x2, 19
...

Should combining memory fence for mutex acquire-exchange loop (or queue acquire-load loop) be done or should it be avoided?

Or with -march=armv8.1-a

Or with `-march=armv8.1-a`