c++multithreading c++11 memory-barriers stdatomic

Is 11 a valid output under ISO c++ for x86_64,arm or other arch?

This question is based on Can't relaxed atomic fetch_add reorder with later loads on x86, like store can? I agree with answer given. On x86 00 will never occur because a.fetch_add has a lock prefix/full barrier and loads can't reorder above fetch_add but on other architectures like arm/mips it can print 00. I have a two followup question about store buffer on x86 and arm.

I never get 11 on my pc (core i3 x86_64) i.e is 11 a valid output on x86 in iso c++ , so am i missing something ? @Daniel Langr demonstrated 11 is a valid output on x86.
Now x86_64 has an advantage fetch_add acting as a full barrier.
For arm64 , output can be 00 sometimes due to cpu instruction reordering.
For arm64 or some other arch, can the output be 00 if without reordering ?. My question is based on this. The store buffer values for function foo a.fetch_add(1) is not visible to bar's a.load() and b.fetch_add(1) is not visible to foo's b.load(). Hence we get 00 without reordering. Can this happen under ISO C++ on different archs ?

// g++ -O2 -pthread axbx.cpp  ; while [ true ]; do ./a.out  | grep "00" ; done
#include<cstdio>
#include<thread>
#include<atomic>
using namespace std;
atomic<int> a,b;
int reta,retb;

void foo(){
        a.fetch_add(1,memory_order_relaxed); //add to a is stored in store buffer of cpu0
        //a.store(1,memory_order_relaxed);
        retb=b.load(memory_order_relaxed);
}

void bar(){
        b.fetch_add(1,memory_order_relaxed); //add to b is stored in store buffer of cpu1
        //b.store(1,memory_order_relaxed);
        reta=a.load(memory_order_relaxed);
}

int main(){
        thread t[2]{ thread(foo),thread(bar) };
        t[0].join(); t[1].join();
        printf("%d%d\n",reta,retb);
        return 0;
}

Solution

Yes, ISO C++ allows this; as Daniel pointed out, one easy way is to put some slow stuff after the RMWs, before the loads, so they don't execute until both threads have had a chance to increment. This should be obvious because it doesn't require any run-time reordering to happen, just a simple interleaving of program-order. (So ISO C++ allows 11 even with seq_cst.) i.e. you could have exercised this by single-stepping each thread separately.

Are you wondering about how to create a practical demonstration on x86 without delay loops?

Try putting your atomic vars in separate cache lines so two different cores can be writing them in parallel.

alignas(64) std::atomic<int> a, b;  // the alignas applies to each separately

With them in the same cache line, like probably happens by default, the core that won ownership of the cache line they're both in is going to be able to execute the load of the other var as soon as the full-barrier part of the increment is done. Completing the RMW means that cache line is already hot in this core's L1d cache. (A core can only modify a cache line after gaining exclusive ownership of it via MESI, which will also make it valid for reads.)

So both operations by one thread are extremely likely to happen before either operation by the other thread. (In the x86 asm, each operation has the same asm as its seq_cst equivalent would have had, so we can usefully talk about a global order of operations without losing anything.)

Probably the only thing that would stop this from happening is an interrupt arriving at just the right moment, between the RMW and the load.

You also asked a separate question:

can the output be 00 if without reordering ?

Clearly no. No interleaving of program-order can put both loads before either increment, so either run-time or compile-time reordering is necessary to create the 00 effect.

    a.fetch_add(1,memory_order_relaxed);  // foo1
    retb=b.load(memory_order_relaxed);    // foo2

    b.fetch_add(1,memory_order_relaxed); // bar1
    reta=a.load(memory_order_relaxed);   // bar2

Mix those however you want without putting foo2 before foo1 or bar2 before bar1.

i.e. if you single-stepped each thread separately, you could never see 00. Of course the whole point of mo_relaxed is that it can reorder. Specifying "without reordering" is the same as saying "with seq_cst".

The effects of a store buffer are a kind of reordering, specifically Store Load reordering. mo_seq_cst prevents even that, which is part of the point of seq_cst and what makes it so expensive, especially for pure-store operations.