c++x86 inline-assembly micro-optimization memory-barriers

An implementation of std::atomic_thread_fence(std::memory_order_seq_cst) on x86 without extra performance penalties

A following-up question for Why does this `std::atomic_thread_fence` work

As a dummy interlocked operation is better than _mm_mfence, and there are quite many ways to implement it, which interlocked operation and on what data should be used?

Assume using an inline assembly that is not aware of surrounding context, but can tell the compiler which registers it clobbers.

Solution

Short answer for now, without going into too much detail about why. See specifically the discussion in comments on that linked question.

lock orb $0, -1(%rsp) is probably a good bet to avoid lengthening dependency chains for local vars that get spilled/reloaded. See https://shipilev.net/blog/2014/on-the-fence-with-dependencies for benchmarks. On Windows x64 (no red zone), that space should be unused except by future call or push instructions.

Store forwarding to the load side of a locked operation might be a thing (if that space was recently used), so keeping the locked operation narrow is good. But being a full barrier, I don't expect there can be any store forwarding from its output to anything else, so unlike normal, a narrow (1 byte) lock orb doesn't have that downside.

mfence is pretty crap compared to a hot line of stack space even on Haswell, probably worse on Skylake where it even blocks OoO exec. (And also bad on AMD compared to lock add).