I was reading some older MS documentation, The C# Memory Model in Theory and Practice, Part 2 and was interested to read:
One possible fix is to insert a memory barrier into both ThreadA and ThreadB... The CLR JIT will insert a “lock or” instruction in place of the memory barrier. A locked x86 instruction has the side effect of flushing the store buffer
and indeed, looking at the output of the dotnet tools on godbolt.org, I see that System.Threading.Thread.MemoryBarrier()
gets compiled (presumably jitted) down to the following:
lock
or dword ptr [rsp], 0
ret
(sharplab.io gives equivalent output for a release build.)
This seems mildly surprising... intel have provided the mfence
instruction that would appear to be ideal for this purpose, and indeed the older dotnet framework method Interlocked.SpeculationBarrier are documented to generate an mfence
under x86 and amd64 (as did the older Thread.VolatileRead
and write methods that have since been deprecated). I don't have suitable tools to see the generated MemoryBarrier()
assembly for other architectures, but the memory model docs suggest that ARM64 gets a dmb
instruction, which is a full memory barrier and hence presumably equivalent to mfence
.
There's an interesting answer by BeeOnRope to Does lock xchg have the same behavior as mfence? that suggests that mfence
offers stronger guarantees than lock
under some circumstances, and I can't offer an opinion on that, but even if the two instructions were precisely equivalent: all else being equal, I'd have chosen mfence
as being more obvious in intent. Presumably the compiler engineers at microsoft know better.
The question then: why lock or
instead of mfence
?
lock or
is surprisingly faster than mfence
, and strong enough. (As long as the cache line its RMWing is already hot in cache and exclusively owned, which is normally the case for the stack.)
lock add
dword or byte, also with an immediate zero, is another common choice. lock and
with -1
would also be possible; any memory-destination RMW that leaves the destination and registers (other than EFLAGS) unchanged is equivalent.
Experiments on some CPUs have found that RMW several bytes below ESP / RSP is faster, at least in a function that's going to ret
(and pop that return address). https://shipilev.net/blog/2014/on-the-fence-with-dependencies/#_experimental_results has some benchmarks from Haswell and contemporary AMD for Java volatile
stores; scroll up from there for what he's benchmarking. (Java volatile
load/store is like C++ std::atomic
with seq_cst
.)
I found that a bit surprising since an x86 lock
ed instruction has to drain the store buffer and modify L1d cache before it completes and before later memory ops are allowed to even start, so I wouldn't have though that later loads could get a head start even if they were on a different address. But apparently it's a thing. Still, many compilers don't do that. For example, GCC avoids it because it makes Valgrind complain about touching memory that isn't allocated. (Using an offset of 0
saves a byte of machine-code size.)
Linux since 4.15 uses lock; addl $0,-4(%rsp)
for smp_mb()
(for communication between cores) but still mfence
for mb()
(for drivers ordering MMIO with other accesses).
mfence
goes above and beyond to make sure it carries out the letter of the spec even in case of weakly-ordered NT loads from WC memory (video RAM) being in flight. On my Skylake for example it also include lfence
-like behaviour of blocking out-of-order exec of non-memory ops, as an implementation detail for how they made it that strong.
As you found, Does lock xchg have the same behavior as mfence? goes into some details.
Related:
Are loads and stores the only instructions that gets reordered? - mfence
on Skylake blocking OoO exec of long imul reg,reg
dep chains, like lfence
.
Which is a better write barrier on x86: lock+addl or xchgl? - mostly talking about seq_cst
stores using xchg, vs. mov+mfence of mov+lock add
to stack space.
Why does this `std::atomic_thread_fence` work - hacked-up MSVC library header code to get the compiler to emit xchg
on a dummy variable instead of mfence
, but with the disastrous choice to have a single static dummy variable that all barriers in all threads contend for.
Why does a std::atomic store with sequential consistency use XCHG? - xchg
vs. mov
+mfence
- my answer there has a whole section about how to compile C++'s full memory barrier, atomic_thread_fence(seq_cst)
(mfence
vs. dummy lock
ed operation). But that's not a duplicate; the actual question was different and I didn't find any where I'd written just this as an answer.
https://shipilev.net/blog/2014/on-the-fence-with-dependencies/ has a big discussion of lots of details, including a microbenchmark.
https://bugs.openjdk.org/browse/JDK-8050147 - some benchmarks from some CPU about -8(%rsp)
vs. (%rsp)
from Aleksey Shipilev, probably the same one he has graphs for in his 2014 blog article.
https://lore.kernel.org/all/[email protected]/T/ - Linux kernel discussion from 2016 about which is better, around when it changed to using a lock add
, then to lock add
with an offset of -4
. (Kernel code doesn't use a red-zone because HW interrupts use the kernel stack, so that memory below RSP is not going to be re-read soon.)