Search code examples
c#multithreadingvolatilecpu-cache

Does MemoryBarrier really ensure refresh values?


Albahari in his marvelous book C# in a nutshell (a free chapter is available online), talks about how memory barrier allows us to get a "refresh" value. His example is:

    static void Main()
    {
        bool complete = false;
        var t = new Thread(() =>
        {
            bool toggle = false;
            while (!complete)
            {
                toggle = !toggle;
            }
        });
        t.Start();
        Thread.Sleep(1000);
        complete = true;
        t.Join();        // Blocks indefinitely
    }

This blocks indefinitely as he suggests if you build in release mode. He offers few solutions to solve this block. Use Thread.MemoryBarrier in the while loop, use lock or make "complete" volatile static field.

I would agree with the volatile field solution as volatile enforces a direct memory read rather than a register read for JIT. However I believe this optimization has nothing to do with fences and memory barriers. It's just a matter of JIT optimization as in if JIT prefers reading it from memory or from a register. Actually instead of using MemoryBarrier, any method call "convinces" JIT not to use the register at all as in:

    class Program
    {
        [MethodImpl( MethodImplOptions.NoInlining)]
        public static bool Toggle(bool toggle)
        {
            return !toggle;
        }
        static void Main()
        {
            bool complete = false;
            var t = new Thread(() =>
            {
                bool toggle = false;
                while (!complete)
                {
                    toggle = Toggle(toggle);
                }
            });
            t.Start();
            Thread.Sleep(1000);
            complete = true;
            t.Join();        // Blocks indefinitely
        }
    }

Here I am making a dummy toggle call. And from the assembly code generated I can clearly see JIT uses direct memory access for reading the "complete" local variable. Thus my assumption, at least on intel CPU and considering the compiler optimizations, MemoryBarrier has no role in terms of "refreshness". MemoryBarrier just aquires a full fence the preseve the order and that's it. Am I correct to think that way?


Solution

  • I would agree with the volatile field solution as volatile enforces a direct memory read rather than a register read for JIT. However I believe this optimization has nothing to do with fences and memory barriers.

    Volatile reads and writes are described in ECMA-335, I.12.6.7. Important parts of this section:

    A volatile read has “acquire semantics” meaning that the read is guaranteed to occur prior to any references to memory that occur after the read instruction in the CIL instruction sequence. A volatile write has “release semantics” meaning that the write is guaranteed to happen after any memory references prior to the write instruction in the CIL instruction sequence.

    A conforming implementation of the CLI shall guarantee this semantics of volatile operations.

    and

    An optimizing compiler that converts CIL to native code shall not remove any volatile operation, nor shall it coalesce multiple volatile operations into a single operation.

    Acquire and release semantics for x86 and x86-64 architectures doesn't require any memory barriers (because the hardware memory model is not weaker than required by volatile semantics). But for ARM architecture JIT must emit half-fences (one direction memory barriers).

    So, in that example with volatile everything works because of optimization restriction. And with MemoryBarrier it works because compiler can't optimize read of that variable into a single read outside the loop, because this read can't cross the MemoryBarrier.

    But the code

    while (!complete)
    {
        toggle = Toggle(toggle);
    }
    

    is allowed to be optimized into something like this:

    var tmp = complete;
    while (!tmp)
    {
        toggle = Toggle(toggle);
    }
    

    The reason why it doesn't happen in case of method call is that for some reason optimization was not applied (but it can be applied). So, this code is fragile and implementation-specific, because it relies not on standard, but on implementation details which might be changed.