c#.net-core assembly x86-64 memory-barriers

Why does C#'s Thread.MemoryBarrier() use "lock or" instead of mfence?

I was reading some older MS documentation, The C# Memory Model in Theory and Practice, Part 2 and was interested to read:

One possible fix is to insert a memory barrier into both ThreadA and ThreadB... The CLR JIT will insert a “lock or” instruction in place of the memory barrier. A locked x86 instruction has the side effect of flushing the store buffer

and indeed, looking at the output of the dotnet tools on godbolt.org, I see that System.Threading.Thread.MemoryBarrier() gets compiled (presumably jitted) down to the following:

       lock     
       or       dword ptr [rsp], 0
       ret

(sharplab.io gives equivalent output for a release build.)

This seems mildly surprising... intel have provided the mfence instruction that would appear to be ideal for this purpose, and indeed the older dotnet framework method Interlocked.SpeculationBarrier are documented to generate an mfence under x86 and amd64 (as did the older Thread.VolatileRead and write methods that have since been deprecated). I don't have suitable tools to see the generated MemoryBarrier() assembly for other architectures, but the memory model docs suggest that ARM64 gets a dmb instruction, which is a full memory barrier and hence presumably equivalent to mfence.

There's an interesting answer by BeeOnRope to Does lock xchg have the same behavior as mfence? that suggests that mfence offers stronger guarantees than lock under some circumstances, and I can't offer an opinion on that, but even if the two instructions were precisely equivalent: all else being equal, I'd have chosen mfence as being more obvious in intent. Presumably the compiler engineers at microsoft know better.

The question then: why lock or instead of mfence?

Solution

lock or is surprisingly faster than mfence, and strong enough. (As long as the cache line its RMWing is already hot in cache and exclusively owned, which is normally the case for the stack.)

lock add dword or byte, also with an immediate zero, is another common choice. lock and with -1 would also be possible; any memory-destination RMW that leaves the destination and registers (other than EFLAGS) unchanged is equivalent.

Experiments on some CPUs have found that RMW several bytes below ESP / RSP is faster, at least in a function that's going to ret (and pop that return address). https://shipilev.net/blog/2014/on-the-fence-with-dependencies/#_experimental_results has some benchmarks from Haswell and contemporary AMD for Java volatile stores; scroll up from there for what he's benchmarking. (Java volatile load/store is like C++ std::atomic with seq_cst.)

I found that a bit surprising since an x86 locked instruction has to drain the store buffer and modify L1d cache before it completes and before later memory ops are allowed to even start, so I wouldn't have though that later loads could get a head start even if they were on a different address. But apparently it's a thing. Still, many compilers don't do that. For example, GCC avoids it because it makes Valgrind complain about touching memory that isn't allocated. (Using an offset of 0 saves a byte of machine-code size.)
Linux since 4.15 uses lock; addl $0,-4(%rsp) for smp_mb() (for communication between cores) but still mfence for mb() (for drivers ordering MMIO with other accesses).

MFENCE being stronger is why it's slower

mfence goes above and beyond to make sure it carries out the letter of the spec even in case of weakly-ordered NT loads from WC memory (video RAM) being in flight. On my Skylake for example it also include lfence-like behaviour of blocking out-of-order exec of non-memory ops, as an implementation detail for how they made it that strong.

As you found, Does lock xchg have the same behavior as mfence? goes into some details.

Are loads and stores the only instructions that gets reordered? - mfence on Skylake blocking OoO exec of long imul reg,reg dep chains, like lfence.
Which is a better write barrier on x86: lock+addl or xchgl? - mostly talking about seq_cst stores using xchg, vs. mov+mfence of mov+lock add to stack space.
Why does this `std::atomic_thread_fence` work - hacked-up MSVC library header code to get the compiler to emit xchg on a dummy variable instead of mfence, but with the disastrous choice to have a single static dummy variable that all barriers in all threads contend for.
Why does a std::atomic store with sequential consistency use XCHG? - xchg vs. mov+mfence - my answer there has a whole section about how to compile C++'s full memory barrier, atomic_thread_fence(seq_cst) (mfence vs. dummy locked operation). But that's not a duplicate; the actual question was different and I didn't find any where I'd written just this as an answer.
https://shipilev.net/blog/2014/on-the-fence-with-dependencies/ has a big discussion of lots of details, including a microbenchmark.
https://bugs.openjdk.org/browse/JDK-8050147 - some benchmarks from some CPU about -8(%rsp) vs. (%rsp) from Aleksey Shipilev, probably the same one he has graphs for in his 2014 blog article.
https://lore.kernel.org/all/20160603133920.GA3913@twins.programming.kicks-ass.net/T/ - Linux kernel discussion from 2016 about which is better, around when it changed to using a lock add, then to lock add with an offset of -4. (Kernel code doesn't use a red-zone because HW interrupts use the kernel stack, so that memory below RSP is not going to be re-read soon.)