Search code examples
c#.net-coreassemblyx86-64memory-barriers

Why does C#'s Thread.MemoryBarrier() use "lock or" instead of mfence?


I was reading some older MS documentation, The C# Memory Model in Theory and Practice, Part 2 and was interested to read:

One possible fix is to insert a memory barrier into both ThreadA and ThreadB... The CLR JIT will insert a “lock or” instruction in place of the memory barrier. A locked x86 instruction has the side effect of flushing the store buffer

and indeed, looking at the output of the dotnet tools on godbolt.org, I see that System.Threading.Thread.MemoryBarrier() gets compiled (presumably jitted) down to the following:

       lock     
       or       dword ptr [rsp], 0
       ret  

(sharplab.io gives equivalent output for a release build.)

This seems mildly surprising... intel have provided the mfence instruction that would appear to be ideal for this purpose, and indeed the older dotnet framework method Interlocked.SpeculationBarrier are documented to generate an mfence under x86 and amd64 (as did the older Thread.VolatileRead and write methods that have since been deprecated). I don't have suitable tools to see the generated MemoryBarrier() assembly for other architectures, but the memory model docs suggest that ARM64 gets a dmb instruction, which is a full memory barrier and hence presumably equivalent to mfence.

There's an interesting answer by BeeOnRope to Does lock xchg have the same behavior as mfence? that suggests that mfence offers stronger guarantees than lock under some circumstances, and I can't offer an opinion on that, but even if the two instructions were precisely equivalent: all else being equal, I'd have chosen mfence as being more obvious in intent. Presumably the compiler engineers at microsoft know better.

The question then: why lock or instead of mfence?


Solution

  • lock or is surprisingly faster than mfence, and strong enough. (As long as the cache line its RMWing is already hot in cache and exclusively owned, which is normally the case for the stack.)

    lock add dword or byte, also with an immediate zero, is another common choice. lock and with -1 would also be possible; any memory-destination RMW that leaves the destination and registers (other than EFLAGS) unchanged is equivalent.

    Experiments on some CPUs have found that RMW several bytes below ESP / RSP is faster, at least in a function that's going to ret (and pop that return address). https://shipilev.net/blog/2014/on-the-fence-with-dependencies/#_experimental_results has some benchmarks from Haswell and contemporary AMD for Java volatile stores; scroll up from there for what he's benchmarking. (Java volatile load/store is like C++ std::atomic with seq_cst.)

    I found that a bit surprising since an x86 locked instruction has to drain the store buffer and modify L1d cache before it completes and before later memory ops are allowed to even start, so I wouldn't have though that later loads could get a head start even if they were on a different address. But apparently it's a thing. Still, many compilers don't do that. For example, GCC avoids it because it makes Valgrind complain about touching memory that isn't allocated. (Using an offset of 0 saves a byte of machine-code size.)
    Linux since 4.15 uses lock; addl $0,-4(%rsp) for smp_mb() (for communication between cores) but still mfence for mb() (for drivers ordering MMIO with other accesses).


    MFENCE being stronger is why it's slower

    mfence goes above and beyond to make sure it carries out the letter of the spec even in case of weakly-ordered NT loads from WC memory (video RAM) being in flight. On my Skylake for example it also include lfence-like behaviour of blocking out-of-order exec of non-memory ops, as an implementation detail for how they made it that strong.

    As you found, Does lock xchg have the same behavior as mfence? goes into some details.


    Related: