Search code examples
c++x86-64memory-alignmentstdatomiccompare-and-swap

C++11: are 16-byte atomic<> variables automatically aligned on 16-byte boundaries allowing CMPXCHG16B instruction?


Are 16-byte atomic<> variables automatically aligned on 16-byte boundaries allowing the compiler/runtime libraries to efficiently use the x86 CMPXCHG16B instruction? Or should we as a matter of style always manually specify alignas(16) for all such variables?


Solution

  • Any decent implementation of std::atomic<> will use alignas itself to make lock cmpxchg16b efficient, if the library uses lock cmpxchg16b at all instead of a mutex for 16-byte objects.

    Not all implementations do, for example I think MSVC's standard library makes 16-byte objects fully non-lock-free using the standard mutex fallback.

    You don't need alignas(16) on atomic<T>.

    You only need manual alignment for atomics if you have a plain T object that you want to use atomic_ref on. atomic_ref<> has no mechanism to align an already existing T object. The current version of the design exposes a required_alignment member you should use. It's up to you to do that for correctness. (Otherwise you get UB which could mean tearing, or just extremely slow system-wide performance for split lock RMWs.)

     // for atomic_ref<T>
    alignas(std::atomic_ref<T>::required_alignment) T sometimes_atomic_var;
    
     // often equivalent, and doesn't require checking that atomic_ref<T> is supported
    alignas(std::atomic<T>) T sometimes_atomic_var;
     // use the same alignment as atomic<T>
    

    Note that a misaligned lock cmpxchg16b split across a cache line boundary would still be atomic but very very slow (same as any locked instruction: the atomicity guarantee for atomic RMW is not contingent on alignment). More like an actual bus lock, instead of just a local-to-this-core cache lock delaying MESI responses.

    Narrower atomics definitely need to be naturally aligned for correctness because pure-load and pure-store can compile to asm pure load or store where HW guarantees require some alignment.

    But 16-byte objects are only guaranteed atomic with lock cmpxchg16b so .load() and .store() have to be implemented with lock cmpxchg16b. (Load with CAS(0,0) to get the old value and either replace 0 with itself or do nothing, and store with a CAS retry loop. This sucks but is somewhat better than a mutex. It doesn't have the read-side scalability you'd expect from a lock-free load, which is one reason GCC7 and later no longer advertizes atomic<16-byte-object> as lock-free, even though it will still use lock cmpxchg16b in the libatomic functions it calls instead of inlining lock cmpxchg16b)