It turns out that all(?) compilers treat std::atomic::load(std::memory_order_relaxed)
as a volatile load (via __iso_volatile_load64
, etc.).
They don't optimize or reorder it at all. Even discarding the loaded value still generates a load instruction, because compilers treat it like it can have side effects.
So, relaxed loads are suboptimal. With that said...
Assume p
points to a monotonically-increasing 8-byte counter in shared memory that is only written to outside my process. My program only reads from this address.
I want to read this counter in a manner such that:
Loads are atomic (no tearing)
Ordering is preserved for this counter (so that x = *p; y = *p;
implies x <= y
)
Loads are not treated as opaque/optimization barriers (except for #2 above)
In particular, the intent here is that the compiler performs as many of the optimizations as it can like it would do on normal memory accesses, e.g.: useless loads (like (void)*p;
) get discarded, other instructions get reordered freely around this memory access, etc.
Is there any way to achieve this on either MSVC or Clang other than with volatile loads?
(Implementation-specific hacks/intrinsics/etc. are OK, as long as those specific implementations never treat it like undefined behavior, so there is no risk of wrong codegen.)
const std::atomic<uint64_t> *p
or std::atomic_ref<>
with std::memory_order_relaxed
give you most of what you want, except for common-subexpression elimination (CSE). In future compilers, you might even get a limited amount of that, or at least optimizing away unused loads. ISO C++ on paper guarantees just barely enough for your use case.
I don't know of anything that's weaker than that but still safe. Making it plain (non-atomic/non-volatile) and wouldn't give you read-read coherence. Even if you writeint x = *p;
in the source, some (maybe not all) later uses of x
might actually reload from *p
. See the "invented loads" section of Who's afraid of a big bad optimizing compiler? on LWN. This might happen for some later uses of x
but not all, making the variable change value. Or for x
but not y
, allowing violation of x<=y
.
Perhaps you use GNU C inline asm like int x = *p; asm("" ::: "memory");
to tell the compiler *p
might have changed. Or maybe a less optimization-hurting thing like asm("" : "+g"(*p))
to tell it only to forget about the value of *p
without being a compiler barrier to all memory reordering. But that's still going to prevent CSE of multiple loads since you're still manually telling the compiler where to forget about things.
Plus, it's hypothetically possible that it might do x = *p
non-atomically if it's not volatile
or atomic
, depending on surrounding code; Which types on a 64-bit computer are naturally atomic in gnu C and gnu C++? -- meaning they have atomic reads, and atomic writes shows an example of a 64-bit store on AArch64 that GCC chooses to compile with stp
of the same value for both halves of the pair, which isn't guaranteed atomic until ARMv8.4 or something. So using non-atomic types and relying on memory barriers is the worst of both worlds, and not guaranteed to work by any compiler-specific guarantees; it is still data-race UB in MSVC and GNU C++
std::atomic<>
with relaxed
meets your correctness requirementsLock-free std::atomic
loads always compile to something that's atomic in hardware, and atomic<uint64_t>
should be lock-free on all mainstream x86 compilers even in 32-bit mode. (Not 100% sure about MSVC, but GCC and Clang know how to use SSE2 movq
for 8-byte atomic loads in 32-bit mode.)
Even relaxed
atomics have read-read coherence ([intro.races]/16): subsequent reads will see the same or a later value in the modification-order. This prevents compile-time reordering, and coherent caches + hardware guarantees make this happen for free (without any extra barrier instructions) even on non-x86 ISAs.
Compilers can and do reorder other memory ops on other vars around relaxed
atomic load/store.
Same goes for volatile
with GCC/clang (and MSVC with /volatile:iso
to make sure it's not treating it as acquire
/release
even when compiling for x86). But std::atomic
with relaxed
portably expresses exactly the semantics you want so future compilers might optimize better. And volatile
ops can't reorder at compile time with other volatile
ops regardless of which location, unlike atomic relaxed
which can.
std::atomic
loads even on non-x86 are just plain asm loads, at least for types that fit in a single register like uint64_t
on x86-64.
Current GCC and Clang do have some minor missed optimizations, though, like they won't use a std::atomic<>.load
as a memory source operand for another instruction (https://godbolt.org/z/sf8WcG7qs)
#include <atomic>
using T = std::atomic<long>;
long read(T &a, long dummy){
return dummy + a.load(std::memory_order_relaxed) + a.load(std::memory_order_relaxed);
}
# GCC13.2 -O3 ; clang is equivalent
read(std::atomic<long long>&, long):
mov rax, QWORD PTR [rdi]
mov rdx, QWORD PTR [rdi]
add rax, rsi
add rax, rdx
ret
vs.
long read_plain(long *p, long *q, long dummy){
return dummy + *p + *q;
}
## GCC13 -O3 ; clang is similar
read_plain(long*, long*, long):
mov rax, QWORD PTR [rsi]
add rdx, QWORD PTR [rdi] ### memory source operand
add rax, rdx
ret
It would be nice if they fixed stuff like that; there's no reason a.store(1 + a.load(relaxed), relaxed);
shouldn't compile to add qword ptr [rdi], 1
(without lock
since I didn't use a.fetch_add
), but GCC will do a separate load / inc / store, like with volatile
but unlike with plain long
.
Clang actually will use inc qword ptr [rdi]
for that atomic load/add/store, so it's only GCC and MSVC with missed optimizations in that case.
Clang will also use volatile
loads as memory source operands for add
, like
long read_volatile(volatile long *p, volatile long *q, long dummy){
return dummy + *p + *q;
}
# clang 17 -O3
read_volatile(long volatile*, long volatile*, long):
mov rax, rdx
add rax, qword ptr [rdi] # same asm it makes for non-volatile
add rax, qword ptr [rsi] # unlike GCC which makes asm like for atomic
ret
And yes, this asm is arguably worse than GCC's plain case: on CPUs without mov-elimination, it's more back-end uops (2 each from add
, plus one from the mov
if not eliminated), and the mov
would be on the critical path for latency from dummy
to result. But that choice is unrelated to volatile
vs. atomic
.
So you might consider using volatile
for now, especially if you're using Clang, since you're compiling for a CISC ISA (x86) where folding loads into memory source operands for other instruction saves front-end bandwidth.