Search code examples
c++x86x86-64atomicstdatomic

Which one performs better: load(memory_order_seq_cst) or atomic_fetch_add(0, memory_order_relaxed) on X86?


Question 1: I have a variable primarily used for counting, only concerned with its own value. Which of the following two approaches has better performance?

Approach 1:

Read:  aaa.fetch_add(0, memory_order_relaxed)
Write: aaa.fetch_add(1, memory_order_relaxed)

Approach 2:

Read:  aaa.load(memory_order_seq_cst)
Write: aaa.fetch_add(1, memory_order_seq_cst)

Question 2: If thread T1: aaa may have already existed in T1's storebuffer, will T1's atomic RMW operation flush the storebuffer?


Solution

  • Expect the pure load to be much faster. Atomic RMWs are slow, pure loads are fast. Atomic RMWs will dirty the cache line. (Or clang optimizes aaa.fetch_add(0, relaxed) into mfence + mov-load: https://godbolt.org/z/adEP7Tcx1)

    On x86, a seq_cst load requires no extra barriers, so is as cheap as a relaxed load, other than compile-time reordering. Pure loads are vastly faster than any RMW, like 3 per clock vs. 1 per 20 clocks in the no-contention case, and pure loads scale with multiple parallel readers. (https://uops.info/ look at lock xadd vs. mov r32, mem)

    Atomic RMWs are full barriers on x86; in asm there's no way to make them weaker than seq_cst. (So yes, they have to wait for the store-buffer to drain.) Only .store(val, seq_cst) requires any extra barrier instruction (typically a lock add byte [rsp], 0 because that's faster than mfence on many CPUs.)

    On some non-x86 like PowerPC or maybe ARMv7, perhaps a relaxed RMW could be cheaper than a seq_cst load in some contexts. But unlikely on ARMv8 (including 32-bit mode) where seq_cst is only as strong as necessary, with a special interaction between SC stores and SC loads, allowing StoreLoad reordering otherwise.

    In terms of inter-thread latency, both should perform the same; the store buffer already tries to drain itself as fast as possible to avoid becoming full and stalling. (The front-end has to allocate a store-buffer entry when issuing a store uop from the front-end into the back-end, so not doing this would be worse for how far ahead out-of-order exec can see.) See Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees? (no)