Assume a repeated acquire operation, that tries to load or exchange a value until the observed value is the desired value.
Let's take cppreference atomic flag example as a starting point:
void f(int n)
{
for (int cnt = 0; cnt < 100; ++cnt) {
while (lock.test_and_set(std::memory_order_acquire)) // acquire lock
; // spin
std::cout << "Output from thread " << n << '\n';
lock.clear(std::memory_order_release); // release lock
}
}
Now let's consider enhancements to this spinning. Two well-known are:
pause
or yield
instead of no-operation spinning.I can think of a third, and I'm wondering if it ever makes sense.
We can use std::atomic_thread_fence
for acquire semantic:
void f(int n)
{
for (int cnt = 0; cnt < 100; ++cnt) {
while (lock.test_and_set(std::memory_order_relaxed)) // acquire lock
; // spin
std::atomic_thread_fence(std::memory_order_acquire); // acquire fence
std::cout << "Output from thread " << n << '\n';
lock.clear(std::memory_order_release); // release lock
}
}
I expect that to be no change for x86.
I'm wondering:
yield
instruction?I'm not only interested in atomic_flag::clear
/ atomic_flag::test_and_set
pair, I'm also interested in atomic<uint32_t>::store
/ atomic<uint32_t>::load
pair.
Possibly changing to relaxed load could make sense:
void f(int n)
{
for (int cnt = 0; cnt < 100; ++cnt) {
while (lock.test_and_set(std::memory_order_acquire)) // acquire lock
while (lock.test(std::memory_order_relaxed))
YieldProcessor(); // spin
std::cout << "Output from thread " << n << '\n';
lock.clear(std::memory_order_release); // release lock
}
}
Yes, the general idea of avoiding an acquire barrier inside the failure retry path is possibly useful, although performance in the failure case is barely relevant if you're just spinning. pause
or yield
save power. On x86, pause
also improves SMT friendlyness, and avoids memory-order mis-speculation when leaving the loop after another core modified the memory location you're spinning on.
But that's why CAS has separate memory_order
parameters for success and failure. Relaxed failure could let the compiler only barrier on the leave-the-loop path.
atomic_flag
test_and_set
doesn't have that option, though. Doing it manually potentially hurts ISAs like AArch64 that could have done an acquire RMW and avoided an explicit fence instruction. (e.g. with ldarb
)
Godbolt: Original loop with lock.test_and_set(std::memory_order_acquire)
:
# AArch64 gcc8.2 -O3
.L6: # do{
ldaxrb w0, [x19] # acquire load-exclusive
stxrb w1, w20, [x19] # relaxed store-exclusive
cbnz w1, .L6 # LL/SC failure retry
tst w0, 255
bne .L6 # }while(old value was != 0)
... no barrier after this
(And yes, it looks like a missed optimization that it's only testing the low 8 bits with tst
instead of just cbnz w1, .L6
)
while(relaxed RMW) + std::atomic_thread_fence(std::memory_order_acquire);
.L14: # do {
ldxrb w0, [x19] # relaxed load-exclusive
stxrb w1, w20, [x19] # relaxed store-exclusive
cbnz w1, .L14 # LL/SC retry
tst w0, 255
bne .L14 # }while(old value was != 0)
dmb ishld #### Acquire fence
...
It's even worse for 32-bit ARMv8 where dmb ishld
isn't available, or compilers don't use it. You'll get a dmb ish
full barrier.
-march=armv8.1-a
.L2:
swpab w20, w0, [x19]
tst w0, 255
bne .L2
mov x2, 19
...
vs.
.L9:
swpb w20, w0, [x19]
tst w0, 255
bne .L9
dmb ishld # acquire barrier (load ordering)
mov x2, 19
...