A following-up question for Why does this `std::atomic_thread_fence` work
As a dummy interlocked operation is better than _mm_mfence
, and there are quite many ways to implement it, which interlocked operation and on what data should be used?
Assume using an inline assembly that is not aware of surrounding context, but can tell the compiler which registers it clobbers.
Short answer for now, without going into too much detail about why. See specifically the discussion in comments on that linked question.
lock orb $0, -1(%rsp)
is probably a good bet to avoid lengthening dependency chains for local vars that get spilled/reloaded. See https://shipilev.net/blog/2014/on-the-fence-with-dependencies for benchmarks. On Windows x64 (no red zone), that space should be unused except by future call or push instructions.
Store forwarding to the load side of a lock
ed operation might be a thing (if that space was recently used), so keeping the locked operation narrow is good. But being a full barrier, I don't expect there can be any store forwarding from its output to anything else, so unlike normal, a narrow (1 byte) lock orb
doesn't have that downside.
mfence
is pretty crap compared to a hot line of stack space even on Haswell, probably worse on Skylake where it even blocks OoO exec. (And also bad on AMD compared to lock add
).