Search code examples
c++gccprocessormemory-barriers

What is the difference between memory barrier and complier-only fence


As the question stated, I'm confused about the difference between memory barrier and complier-only fence.

Are they the same? If not what is the difference between them?


Solution

  • As a concrete example, consider the following code:

    int x = 0, y = 0;
    
    void foo() {
        x = 10;
        y = 20;
    }
    

    As it stands, without any barriers or fences, the compiler may reorder the two stores and emit assembly (pseudo)code like

    STORE [y], 20
    STORE [x], 10
    

    If you insert a compiler-only fence between x = 10; and y = 20;, the compiler is inhibited from doing this, and must instead emit

    STORE [x], 10
    STORE [y], 20
    

    However, suppose we have another observer looking at the values of x and y in memory, such as a memory-mapped hardware device, or another thread that is going to do

    void observe() {
        std::cout << x << ", ";
        std::cout << y << std::endl;
    }
    

    (Assume for simplicity that the loads from x and y in observe() do not get reordered in any way, and that loads and stores to int happen to be atomic on this system.) Depending on when its loads take place with respect to the stores in foo(), we can see that it could print out 0, 0 or 10, 0 or 10, 20. It might appear that 0, 20 would be impossible, but that is actually not so in general.

    Even though the instructions in foo stored x and y in that order, on some architectures without strict store ordering, that does not guarantee that those stores will become visible to observe() in the same order. It could be that due to out-of-order execution, the core executing foo() actually executed the store to y before the store to x. (Say, if the cache line containing y was already in L1 cache, but the cache line for x was not; the CPU might as well go ahead and do the store to y rather than stalling for possibly hundreds of cycles while the cache line for x is loaded.) Or, the stores could be held in a store buffer and possibly flushed out to L1 cache in the opposite order. Either way, it is possible that observe() prints out 0, 20.

    To ensure the desired ordering, the CPU has to be told to do so, often by executing an explicit memory barrier instruction between the two stores. This will cause the CPU to wait until the store to x has been made visible (by loading the cache line, draining the store buffer, etc) before making the store to y visible. So if you ask the compiler to put in a memory barrier, it will emit assembly like

    STORE [x], 10
    BARRIER
    STORE [y], 20
    

    In this case, you can be assured that observe() will print either 0, 0 or 10, 0 or 10, 20, but never 0, 20.

    (Please note that many simplifying assumptions have been made here. If trying to write this in actual C++, you'd need to use std::atomic types and some similar barrier in observe() to ensure its loads were not reordered.)