Search code examples
javax86volatilejvm-hotspotmemory-barriers

Why does Unsafe.fullFence() not ensuring visibility in my example?


I am trying to dive deep into volatile keyword in Java and setup 2 testing environments. I believe both of them are with x86_64 and use hotspot.

Java version: 1.8.0_232
CPU: AMD Ryzen 7 8Core

Java version: 1.8.0_231
CPU: Intel I7

Code is here:

import java.lang.reflect.Field;
import sun.misc.Unsafe;

public class Test {

  private boolean flag = true; //left non-volatile intentionally
  private volatile int dummyVolatile = 1;

  public static void main(String[] args) throws Exception {
    Test t = new Test();
    Field f = Unsafe.class.getDeclaredField("theUnsafe");
    f.setAccessible(true);
    Unsafe unsafe = (Unsafe) f.get(null);

    Thread t1 = new Thread(() -> {
        while (t.flag) {
          //int b = t.someValue;
          //unsafe.loadFence();
          //unsafe.storeFence();
          //unsafe.fullFence();
        }
        System.out.println("Finished!");
      });

    Thread t2 = new Thread(() -> {
        t.flag = false;
        unsafe.fullFence();
      });

    t1.start();
    Thread.sleep(1000);
    t2.start();
    t1.join();
  }
}

"Finished!" is never printed which does not make sense to me. I am expecting the fullFence in thread 2 makes the flag = false globally visible.

From my research, Hotspot uses lock/mfence to implement fullFence on x86. And according to Intel's instruction-set reference manual entry for mfence

This serializing operation guarantees that every load and store instruction that precedes the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows the MFENCE instruction.

Even "worse", if I comment out fullFence in thread 2 and un-comment any one of the xxxFence in thread 1, the code prints out "Finished!" This makes even less sense, because at least lfence is "useless"/no-op in x86.

Maybe my source of information contains inaccuracy or i am misunderstanding something. Please help, thanks!


Solution

  • It's not the runtime effect of the fence that matters, it's the compile-time effect of forcing the compiler to reload stuff.

    Your t1 loop contains no volatile reads or anything else that could synchronize-with another thread, so there's no guarantee it will ever notice any changes to any variables. i.e. when JITing into asm, the compiler can make a loop that loads the value into a register once, instead of reloading it from memory every time. This is the kind of optimization you always want the compiler to be able to do for non-shared data, which is why the language has rules that let it do this when there's no possible synchronization.

    And then of course the condition can get hoisted out of the loop. So with no barriers or anything, your reader loop can JIT into asm that implements this logic:

    if(t.flag) {
       for(;;){}  // infinite loop
    }
    

    Besides ordering, the other part of Java volatile is the assumption that other threads may change it asynchronously, so multiple reads can't be assumed to give the same value.

    But unsafe.loadFence(); makes the JVM reload t.flag from (cache-coherent) memory every iteration. I don't know if this is required by the Java spec or merely an implementation detail that makes it happen to work.

    If this was C++ with a non-atomic variable (which would be undefined behaviour in C++), you'd see exactly the same effect in a compiler like GCC. _mm_lfence would also be a compile-time full-barrier as well as emitting a useless lfence instruction, effectively telling the compiler that all memory might have changed and thus needs to be reloaded. So it can't reorder loads across it, or hoist them out of loops.

    BTW, I wouldn't be so sure that unsafe.loadFence() even JITs to an lfence instruction on x86. It is useless for memory ordering (except for very obscure stuff like fencing NT loads from WC memory, e.g. copying from video RAM, which the JVM can assume isn't happening), so a JVM JITing for x86 could just treat it as a compile-time barrier. Just like what C++ compilers do for std::atomic_thread_fence(std::memory_order_acquire); - block compile time reordering of loads across the barrier, but emit no asm instructions because the asm memory of the host running the JVM is already strong enough.


    In thread 2, unsafe.fullFence(); is I think useless. It just makes that thread wait until earlier stores become globally visible, before any later loads/stores can happen. t.flag = false; is a visible side effect that can't be optimized away so it definitely happens in the JITed asm whether there's a barrier following it or not, even though it's not volatile. And it can't be delayed or merged with something else because there's nothing else in the same thread.

    Asm stores always become visible to other threads, the only question is whether the current thread waits for its store buffer to drain or not before doing more stuff (especially loads) in this thread. i.e. prevent all reordering, including StoreLoad. Java volatile does that, like C++ memory_order_seq_cst (by using a full barrier after every store), but without a barrier it's still a store like C++ memory_order_relaxed. (Or when JITing x86 asm, loads/stores are actually as strong as acquire/release.)

    Caches are coherent, and the store buffer always drains itself (committing to L1d cache) as fast as it can to make room for more stores to execute.


    Caveat: I don't know a lot of Java, and I don't know exactly how unsafe / undefined it is to assign a non-volatile in one thread and read it in another with no synchronization. Based on the behaviour you're seeing, it sounds exactly like what you'd see in C++ for the same thing with non-atomic variables (with optimization enabled, like HotSpot always does)

    (Based on @Margaret's comment, I updated with some guesswork about how I assume Java synchronization works. If I mis-stated anything, please edit or comment.)

    In C++ data races on non-atomic vars are always Undefined Behaviour, but of course when compiling for real ISAs (which don't do hardware race-prevention) the results are sometimes what people wanted.


    PS: just using barriers to force a compiler to re-read a value isn't safe in general: it could choose to re-read the value multiple times even if the source copies it to a local variable. So the same tmp var might seem to be both true and false in one execution. At least that's true in C and C++ because data races are undefined behaviour in those languages; see Who's afraid of a big bad optimizing compiler? on LWN about this and other problems you'd run into if you just use barriers and plain (non-volatile) variables. Again, I don't know if that's a possible problem in Java or if the language spec would forbid a JVM from inventing loads after int tmp = shared_plain_int; if tmp is used multiple times across function calls.