Making sense of Memory Barriers

I am attempting to understand memory barriers at a level useful for java lock-free programmers.This level, I feel, is somewhere between learning just about volatiles and learning about working of Store/Load buffers from an x86 manual.

I spent some time reading a bunch of blogs/cookbooks and have come up with the summary below. Could someone more knowledgeable look at the summary to see if I have missed or listed something incorrectly.

LFENCE:

Name             : LFENCE/Load Barrier/Acquire Fence
Barriers         : LoadLoad + LoadStore
Details          : Given sequence {Load1, LFENCE, Load2, Store1}, the
                   barrier ensures that Load1 can't be moved south and
                   Load2 and Store1 can't be moved north of the
                   barrier. 
                   Note that Load2 and Store1 can still be reordered.

Buffer Effect    : Causes the contents of the LoadBuffer 
                   (pending loads) to be processed for that CPU.This
                   makes program state exposed from other CPUs visible
                   to this CPU before Load2 and Store1 are executed.

Cost on x86      : Either very cheap or a no-op.
Java instructions: Reading a volatile variable, Unsafe.loadFence()

SFENCE

Name             : SFENCE/Store Barrier/Release Fence
Barriers         : StoreStore + LoadStore
Details          : Given sequence {Load1, Store1, SFENCE, Store2,Load2}
                   the barrier ensures that Load1 and Store1 can't be
                   moved south and Store2 can't be moved north of the 
                   barrier.
                   Note that Load1 and Store1 can still be reordered AND 
                   Load2 can be moved north of the barrier.
Buffer Effect    : Causes the contents of the StoreBuffer flushed to 
                   cache for the CPU on which it is issued.
                   This will make program state visible to other CPUs
                   before Store2 and Load1 are executed.
Cost on x86      : Either very cheap or a no-op.
Java instructions: lazySet(), Unsafe.storeFence(), Unsafe.putOrdered*()

MFENCE

Name             : MFENCE/Full Barrier/Fence
Barriers         : StoreLoad
Details          : Obtains the effects of the other three barrier.
                   Given sequence {Load1, Store1, MFENCE, Store2,Load2}, 
                   the barrier ensures that Load1 and Store1 can't be
                   moved south and Store2 and Load2 can't be moved north
                   of the barrier.
                   Note that Load1 and Store1 can still be reordered AND
                   Store2 and Load2 can still be reordered.
 Buffer Effect   : Causes the contents of the LoadBuffer (pending loads) 
                   to be processed for that CPU.
                   AND
                   Causes the contents of the StoreBuffer flushed to
                   cache for the CPU on which it is issued.
 Cost on x86     : The most expensive kind.
Java instructions: Writing to a volatile, Unsafe.fullFence(), Locks

Finally, if both SFENCE and MFENCE drains the storeBuffer (invalidates cacheline and waits for acks from other cpus), why is one a no-op and the other a very expensive op?

Thanks

(Cross-posted from Google's Mechanical Sympathy forum)

Solution

You're using Java, so all that really matters is the Java memory model. Compile-time (including JIT) optimizations will re-order your memory accesses within the limitations of the Java memory model, not the stronger x86 memory model that the JVM happens to be JIT-compiling for. (See my answer to How does memory reordering help processors and compilers?)

Still, learning about x86 can give your understanding a concrete foundation, but don't fall into the trap of thinking that Java on x86 works like assembly on x86. (Or that the whole world is x86. Many other architectures are weakly ordered, like the Java memory model.)

x86 LFENCE and SFENCE are no-ops as far as memory ordering, unless you used movnt weakly-ordered cache-bypassing stores. Normal loads are implicitly acquire-loads, and normal stores are implicitly release-stores.

You have an error in your table: SFENCE is "not ordered with respect to load instructions", according to Intel's instruction set reference manual. It is only a StoreStore barrier, not a LoadStore barrier.

(That link is an html conversion of Intel's pdfs. See the x86 tag wiki for links to the official version.)

lfence is a LoadLoad and LoadStore barrier, so your table is correct.

But CPUs don't really "buffer" loads ahead of time. They do them and start using the results for out-of-order execution as soon as the results are available. (Usually instructions using the result of a load have already been decoded and issued before the result of a load is ready, even on an L1 cache hit). This is the fundamental difference between loads and stores.

SFENCE is cheap because it doesn't actually have to drain the store buffer. That's one way to implement it which keeps the hardware simple, at the cost of performance.

MFENCE is expensive because it's the only barrier that prevents StoreLoad reordering. See Jeff Preshing's Memory Reordering Caught in the Act for an explanation, and a test program that actually demonstrates StoreLoad reordering on real hardware.

Jeff Preshing's blog posts are gold for understanding lock-free programming and memory ordering semantics. I usually link his blog in my SO answers to memory-ordering questions. You can probably use search on that to find those answers, if you're interested in reading more of what I've written (mostly C++ / asm, not Java though).

Fun fact: Any atomic read-modify-write operation on x86 is also a full memory barrier. The lock prefix, which is implicit on xchg [mem], reg, is also a full barrier. lock add [esp], 0 was a common idiom for a memory barrier that's otherwise a no-op, before mfence was available. (stack memory is almost always hot in L1, and not shared).

So on x86, incrementing an atomic counter has the same performance regardless of the memory-ordering semantics you request. (e.g. c++11 memory_order_relaxed vs. memory_order_seq_cst (sequential consistency)). Use whatever memory-order semantics are appropriate, though, because other architectures can do atomic ops without full memory barriers. Forcing the compiler / JVM to use a memory barrier when you don't need it is a waste.