Search code examples
c++visual-c++x86clangatomic

Fastest way to perform an atomic read in this *very* specific situation?


Background

It turns out that all(?) compilers treat std::atomic::load(std::memory_order_relaxed) as a volatile load (via __iso_volatile_load64, etc.).
They don't optimize or reorder it at all. Even discarding the loaded value still generates a load instruction, because compilers treat it like it can have side effects.

So, relaxed loads are suboptimal. With that said...


Question (x86)

Assume p points to a monotonically-increasing 8-byte counter in shared memory that is only written to outside my process. My program only reads from this address.

I want to read this counter in a manner such that:

  1. Loads are atomic (no tearing)

  2. Ordering is preserved for this counter (so that x = *p; y = *p; implies x <= y)

  3. Loads are not treated as opaque/optimization barriers (except for #2 above)

In particular, the intent here is that the compiler performs as many of the optimizations as it can like it would do on normal memory accesses, e.g.: useless loads (like (void)*p;) get discarded, other instructions get reordered freely around this memory access, etc.

Is there any way to achieve this on either MSVC or Clang other than with volatile loads?

(Implementation-specific hacks/intrinsics/etc. are OK, as long as those specific implementations never treat it like undefined behavior, so there is no risk of wrong codegen.)


Solution

  • const std::atomic<uint64_t> *p or std::atomic_ref<> with std::memory_order_relaxed give you most of what you want, except for common-subexpression elimination (CSE). In future compilers, you might even get a limited amount of that, or at least optimizing away unused loads. ISO C++ on paper guarantees just barely enough for your use case.

    I don't know of anything that's weaker than that but still safe. Making it plain (non-atomic/non-volatile) and wouldn't give you read-read coherence. Even if you writeint x = *p; in the source, some (maybe not all) later uses of x might actually reload from *p. See the "invented loads" section of Who's afraid of a big bad optimizing compiler? on LWN. This might happen for some later uses of x but not all, making the variable change value. Or for x but not y, allowing violation of x<=y.

    Perhaps you use GNU C inline asm like int x = *p; asm("" ::: "memory"); to tell the compiler *p might have changed. Or maybe a less optimization-hurting thing like asm("" : "+g"(*p)) to tell it only to forget about the value of *p without being a compiler barrier to all memory reordering. But that's still going to prevent CSE of multiple loads since you're still manually telling the compiler where to forget about things.

    Plus, it's hypothetically possible that it might do x = *p non-atomically if it's not volatile or atomic, depending on surrounding code; Which types on a 64-bit computer are naturally atomic in gnu C and gnu C++? -- meaning they have atomic reads, and atomic writes shows an example of a 64-bit store on AArch64 that GCC chooses to compile with stp of the same value for both halves of the pair, which isn't guaranteed atomic until ARMv8.4 or something. So using non-atomic types and relying on memory barriers is the worst of both worlds, and not guaranteed to work by any compiler-specific guarantees; it is still data-race UB in MSVC and GNU C++


    std::atomic<> with relaxed meets your correctness requirements

    1. Lock-free std::atomic loads always compile to something that's atomic in hardware, and atomic<uint64_t> should be lock-free on all mainstream x86 compilers even in 32-bit mode. (Not 100% sure about MSVC, but GCC and Clang know how to use SSE2 movq for 8-byte atomic loads in 32-bit mode.)

    2. Even relaxed atomics have read-read coherence ([intro.races]/16): subsequent reads will see the same or a later value in the modification-order. This prevents compile-time reordering, and coherent caches + hardware guarantees make this happen for free (without any extra barrier instructions) even on non-x86 ISAs.

    3. Compilers can and do reorder other memory ops on other vars around relaxed atomic load/store.

      Same goes for volatile with GCC/clang (and MSVC with /volatile:iso to make sure it's not treating it as acquire/release even when compiling for x86). But std::atomic with relaxed portably expresses exactly the semantics you want so future compilers might optimize better. And volatile ops can't reorder at compile time with other volatile ops regardless of which location, unlike atomic relaxed which can.


    std::atomic loads even on non-x86 are just plain asm loads, at least for types that fit in a single register like uint64_t on x86-64.

    Current GCC and Clang do have some minor missed optimizations, though, like they won't use a std::atomic<>.load as a memory source operand for another instruction (https://godbolt.org/z/sf8WcG7qs)

    #include <atomic>
    
    using T = std::atomic<long>;
    long read(T &a, long dummy){
        return dummy + a.load(std::memory_order_relaxed) + a.load(std::memory_order_relaxed);
    }
    
    # GCC13.2 -O3 ;  clang is equivalent
    read(std::atomic<long long>&, long):
            mov     rax, QWORD PTR [rdi]
            mov     rdx, QWORD PTR [rdi]
            add     rax, rsi
            add     rax, rdx
            ret
    

    vs.

    long read_plain(long *p, long *q, long dummy){
        return dummy + *p + *q;
    }
    
    ## GCC13 -O3  ; clang is similar
    read_plain(long*, long*, long):
            mov     rax, QWORD PTR [rsi]
            add     rdx, QWORD PTR [rdi]   ### memory source operand
            add     rax, rdx
            ret
    

    It would be nice if they fixed stuff like that; there's no reason a.store(1 + a.load(relaxed), relaxed); shouldn't compile to add qword ptr [rdi], 1 (without lock since I didn't use a.fetch_add), but GCC will do a separate load / inc / store, like with volatile but unlike with plain long.

    Clang actually will use inc qword ptr [rdi] for that atomic load/add/store, so it's only GCC and MSVC with missed optimizations in that case.

    Clang will also use volatile loads as memory source operands for add, like

    long read_volatile(volatile long *p, volatile long *q, long dummy){
        return dummy + *p + *q;
    }
    
    # clang 17 -O3
    read_volatile(long volatile*, long volatile*, long):
            mov     rax, rdx
            add     rax, qword ptr [rdi]  # same asm it makes for non-volatile
            add     rax, qword ptr [rsi]  # unlike GCC which makes asm like for atomic
            ret
    

    And yes, this asm is arguably worse than GCC's plain case: on CPUs without mov-elimination, it's more back-end uops (2 each from add, plus one from the mov if not eliminated), and the mov would be on the critical path for latency from dummy to result. But that choice is unrelated to volatile vs. atomic.

    So you might consider using volatile for now, especially if you're using Clang, since you're compiling for a CISC ISA (x86) where folding loads into memory source operands for other instruction saves front-end bandwidth.