Which one performs better: load(memory_order_seq_cst) or atomic_fetch_add(0, memory_order_relaxed) on X86?

Question 1: I have a variable primarily used for counting, only concerned with its own value. Which of the following two approaches has better performance?

Approach 1:

Read:  aaa.fetch_add(0, memory_order_relaxed)
Write: aaa.fetch_add(1, memory_order_relaxed)

Approach 2:

Read:  aaa.load(memory_order_seq_cst)
Write: aaa.fetch_add(1, memory_order_seq_cst)

Question 2: If thread T1: aaa may have already existed in T1's storebuffer, will T1's atomic RMW operation flush the storebuffer?

Solution

Expect the pure load to be much faster. Atomic RMWs are slow, pure loads are fast. Atomic RMWs will dirty the cache line. (Or clang optimizes aaa.fetch_add(0, relaxed) into mfence + mov-load: https://godbolt.org/z/adEP7Tcx1)

On x86, a seq_cst load requires no extra barriers, so is as cheap as a relaxed load, other than compile-time reordering. Pure loads are vastly faster than any RMW, like 3 per clock vs. 1 per 20 clocks in the no-contention case, and pure loads scale with multiple parallel readers. (https://uops.info/ look at lock xadd vs. mov r32, mem)

Atomic RMWs are full barriers on x86; in asm there's no way to make them weaker than seq_cst. (So yes, they have to wait for the store-buffer to drain.) Only .store(val, seq_cst) requires any extra barrier instruction (typically a lock add byte [rsp], 0 because that's faster than mfence on many CPUs.)

On some non-x86 like PowerPC or maybe ARMv7, perhaps a relaxed RMW could be cheaper than a seq_cst load in some contexts. But unlikely on ARMv8 (including 32-bit mode) where seq_cst is only as strong as necessary, with a special interaction between SC stores and SC loads, allowing StoreLoad reordering otherwise.

In terms of inter-thread latency, both should perform the same; the store buffer already tries to drain itself as fast as possible to avoid becoming full and stalling. (The front-end has to allocate a store-buffer entry when issuing a store uop from the front-end into the back-end, so not doing this would be worse for how far ahead out-of-order exec can see.) See Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees? (no)