How is atomic_flag
is implemented? It feels to me that on x86-64 it is equivalent to atomic_bool
anyway, but it is just a guess. Might the x86-64 implementation be any different from arm or x86?
Yeah, on normal CPUs where atomic<bool>
and atomic<int>
are also lock-free, it's pretty much like atomic<bool>
, using the same instructions. (x86 and x86-64 have the same set of atomic operations available.)
You might think that it would always use x86 lock bts
or lock btr
to set / reset (clear) a single bit, but it can be more efficient to do other things (especially for a function that returns a bool instead of branching on it). The object is a whole byte so you can just store or exchange the whole byte. (And if the ABI guarantees that the value is always 0
or 1
, you don't have to booleanize it before returning the result as a bool
)
GCC and clang compile test_and_set
to a byte exchange, and clear to a byte store of 0
. We get (nearly) identical asm for atomic_flag
test_and_set
as f.exchange(true);
#include <atomic>
bool TAS(std::atomic_flag &f) {
return f.test_and_set();
}
bool TAS_bool(std::atomic<bool> &f) {
return f.exchange(true);
}
void clear(std::atomic_flag &f) {
//f = 0; // deleted
f.clear();
}
void clear_relaxed(std::atomic_flag &f) {
f.clear(std::memory_order_relaxed);
}
void bool_clear(std::atomic<bool> &f) {
f = false; // deleted
}
On Godbolt for x86-64 with gcc and clang, and for ARMv7 and AArch64.
## GCC9.2 -O3 for x86-64
TAS(std::atomic_flag&):
mov eax, 1
xchg al, BYTE PTR [rdi]
ret
TAS_bool(std::atomic<bool>&):
mov eax, 1
xchg al, BYTE PTR [rdi]
test al, al
setne al # missed optimization, doesn't need to booleanize to 0/1
ret
clear(std::atomic_flag&):
mov BYTE PTR [rdi], 0
mfence # memory fence to drain store buffer before future loads
ret
clear_relaxed(std::atomic_flag&):
mov BYTE PTR [rdi], 0 # x86 stores are already mo_release, no barrier
ret
bool_clear(std::atomic<bool>&):
mov BYTE PTR [rdi], 0
mfence
ret
Note that xchg
is also an efficient way to do a seq_cst
store on x86-64, usually more efficient than the mov
+ mfence
that gcc uses. Clang uses xchg
for all of these (except the relaxed store).
Amusingly, clang re-booleanizes to 0/1 after the xchg in atomic_flag.test_and_set()
, but GCC instead does it after atomic<bool>
. clang does a weird and al,1
in TAS_bool, which would treat values like 2
as false. It seems totally pointless; the ABI guarantees that a bool
in memory is always stored as a 0
or 1
byte.
For ARM, we have ldrexb
/ strexb
exchange retry loops, or just strb
+ dmb ish
for the pure store. Or AArch64 can use stlrb wzr, [x0]
for clear
or assign-false to do a sequential-release store (of the zero-register) without needing a barrier.