c++11 gcc x86 memory-barriers memory-model

x86 mfence and C++ memory barrier

I'm checking how the compiler emits instructions for multi-core memory barriers on x86_64. The below code is the one I'm testing using gcc_x86_64_8.3.

std::atomic<bool> flag {false};
int any_value {0};

void set()
{
  any_value = 10;
  flag.store(true, std::memory_order_release);
}

void get()
{
  while (!flag.load(std::memory_order_acquire));
  assert(any_value == 10);
}

int main()
{
  std::thread a {set};
  get();
  a.join();
}

When I use std::memory_order_seq_cst, I can see the MFENCE instruction is used with any optimization -O1, -O2, -O3. This instruction makes sure the store buffers are flushed, therefore updating their data in L1D cache (and using MESI protocol to make sure other threads can see effect).

However when I use std::memory_order_release/acquire with no optimizations MFENCE instruction is also used, but the instruction is omitted using -O1, -O2, -O3 optimizations, and not seeing other instructions that flush the buffers.

In the case where MFENCE is not used, what makes sure the store buffer data is committed to cache memory to ensure the memory order semantics?

Below is the assembly code for the get/set functions with -O3, like what we get on the Godbolt compiler explorer:

set():
        mov     DWORD PTR any_value[rip], 10
        mov     BYTE PTR flag[rip], 1
        ret


.LC0:
        .string "/tmp/compiler-explorer-compiler119218-62-hw8j86.n2ft/example.cpp"
.LC1:
        .string "any_value == 10"

get():
.L8:
        movzx   eax, BYTE PTR flag[rip]
        test    al, al
        je      .L8
        cmp     DWORD PTR any_value[rip], 10
        jne     .L15
        ret
.L15:
        push    rax
        mov     ecx, OFFSET FLAT:get()::__PRETTY_FUNCTION__
        mov     edx, 17
        mov     esi, OFFSET FLAT:.LC0
        mov     edi, OFFSET FLAT:.LC1
        call    __assert_fail

Solution

The x86 memory ordering model provides #StoreStore and #LoadStore barriers for all store instructions¹, which is all what the release semantics require. Also the processor will commit a store instruction as soon as possible; when the store instruction retires, the store becomes the oldest in the store buffer, the core has the target cache line in a writeable coherence state, and a cache port is available to perform the store operation². So there is no need for an MFENCE instruction. The flag will become visible to the other thread as soon as possible and when it does, any_value is guaranteed to be 10.

On the other hand, sequential consistency also requires #StoreLoad and #LoadLoad barriers. MFENCE is required to provide both³ barriers and so it is used at all optimization levels.

Footnotes:

(1) There are exceptions that don't apply here. In particular, non-temporal stores and stores to the uncacheable write-combining memory types provide only the #LoadStore barrier. Anyway, these barriers are provided for stores to the write-back memory type on both Intel and AMD processors.

(2) This is in contrast to write-combining stores which are made globally-visible under certain conditions. See Section 11.3.1 of the Intel manual Volume 3.

(3) See the discussion under Peter's answer.