Are 16-byte atomic<>
variables automatically aligned on 16-byte boundaries allowing the compiler/runtime libraries to efficiently use the x86 CMPXCHG16B
instruction? Or should we as a matter of style always manually specify alignas(16)
for all such variables?
Any decent implementation of std::atomic<>
will use alignas
itself to make lock cmpxchg16b
efficient, if the library uses lock cmpxchg16b
at all instead of a mutex for 16-byte objects.
Not all implementations do, for example I think MSVC's standard library makes 16-byte objects fully non-lock-free using the standard mutex fallback.
You don't need alignas(16)
on atomic<T>
.
You only need manual alignment for atomics if you have a plain T
object that you want to use atomic_ref
on. atomic_ref<>
has no mechanism to align an already existing T object. The current version of the design exposes a required_alignment
member you should use. It's up to you to do that for correctness. (Otherwise you get UB which could mean tearing, or just extremely slow system-wide performance for split lock
RMWs.)
// for atomic_ref<T>
alignas(std::atomic_ref<T>::required_alignment) T sometimes_atomic_var;
// often equivalent, and doesn't require checking that atomic_ref<T> is supported
alignas(std::atomic<T>) T sometimes_atomic_var;
// use the same alignment as atomic<T>
Note that a misaligned lock cmpxchg16b
split across a cache line boundary would still be atomic but very very slow (same as any lock
ed instruction: the atomicity guarantee for atomic RMW is not contingent on alignment). More like an actual bus lock, instead of just a local-to-this-core cache lock delaying MESI responses.
Narrower atomics definitely need to be naturally aligned for correctness because pure-load and pure-store can compile to asm pure load or store where HW guarantees require some alignment.
But 16-byte objects are only guaranteed atomic with lock cmpxchg16b
so .load()
and .store()
have to be implemented with lock cmpxchg16b
. (Load with CAS(0,0) to get the old value and either replace 0 with itself or do nothing, and store with a CAS retry loop. This sucks but is somewhat better than a mutex. It doesn't have the read-side scalability you'd expect from a lock-free load
, which is one reason GCC7 and later no longer advertizes atomic<16-byte-object>
as lock-free, even though it will still use lock cmpxchg16b
in the libatomic functions it calls instead of inlining lock cmpxchg16b
)