c++multithreading memory-barriers stdatomic

Multithreading atomics a b printing 00 for memory_order_relaxed

In the below code the write to a in foo is stored in store buffer and not visible to the ra in bar. Similarly the write to b in bar is not visible to rb in foo and they print 00.

// g++ -O2 -pthread axbx.cpp ; while [ true ]; do ./a.out | grep "00"; done prints 00 within 1min
#include<atomic>
#include<thread>
#include<cstdio>
using namespace std;
atomic<long> a,b;
long ra,rb;
void foo(){
        a.store(1,memory_order_relaxed);
        rb=b.load(memory_order_relaxed);
}
void bar(){
        b.store(1,memory_order_relaxed);
        ra=a.load(memory_order_relaxed);
}
int main(){
  thread t[2]{ thread(foo),thread(bar)};
  t[0].join();t[1].join();
  if((ra==0) && (rb==0)) printf("00\n"); // each cpu store buffer writes not visible to other threads.
}

The below code is almost the same as above except the variable b is removed and both foo and bar have the same variable 'a' and the return value is stored in ra1 and ra2. In this case i never get a "00" atleast after running for 5 minutes.

In the second case why doesn't it print 00 ? How come writes to x are not stored in cpu cache for both threads and then print 00 ?
Does it have anything to do with x86_64 but it prints 00 on arm/arm64/power ?
If arm/arm64/power prints 00 , will a smp_mb() after store in foo and bar fix it ?

// g++ -O2 -pthread axbx.cpp ; while [ true ]; do ./a.out | grep "00"; done doesn't print 00 within 5 min
#include<atomic>
#include<thread>
#include<cstdio>
using namespace std;
atomic<long> a,b;
long ra1,ra2;
void foo(){
        a.store(1,memory_order_relaxed);
        ra1=a.load(memory_order_relaxed);
}
void bar(){
        a.store(1,memory_order_relaxed);
        ra2=a.load(memory_order_relaxed);
}
int main(){
  thread t[2]{ thread(foo),thread(bar)};
  t[0].join();t[1].join();
  if((ra1==0) && (ra2==0)) printf("00\n"); // each cpu store buffer writes not visible to other threads.
}

Solution

a.store(1, mo_relaxed) is sequenced before a.load in the same thread (in both foo and bar), so both loads must see that store result (or another later store value). That makes it impossible for either load to see the initial 0.

A thread always sees its own operations in program order, even if they're on atomic objects using mo_relaxed. That's basically equivalent to making sure the stores and loads happen in asm (in program order), but without any extra barriers to prevent runtime reordering when observed by other threads, like if you'd used volatile. (But don't). The cardinal rule of out-of-order execution is "don't break single-threaded code".

And BTW, you're actually correct that the value can be forwarded directly from the store buffer before it hits L1d cache and becomes globally visible. (Because you didn't use mo_seq_cst; seq_cst loads can't happen until previous seq_cst stores are globally visible. e.g. on x86 it would have to compile to an xchg store, or mov + mfence. Semi-related: Globally Invisible load instructions about where load results come from on x86, although the general point about store-forwarding applies to most mainstream CPUs including most ARM.)

So in practice the loads are very likely to see the 1 stored by their own thread, not the 1 from the other thread, because it will compile to asm that allows the store to forward to the load, and the load is right after so it's probably already in the process of executing and waiting for the store-data to be forwarded before there's any window for the other thread's store to be visible between them, unless an interrupt arrives between the store and load.

You could check by storing a 1 and a 2, for example, to see if you always get 12 or sometimes get 21.

Your analysis of why you can see 00 in your version using 2 variables is pretty sloppy.

In the below code the write to a in foo is stored in store buffer and not visible to the ra in bar. Similarly the write to b in bar is not visible to rb in foo

Yes the store buffer is the normal cause of StoreLoad reordering, and if both foo and bar happen to execute at nearly the same time, then yes both loads can happen and grab the old values before either store can get itself committed to L1d cache. So if that does happen, then yes it's because of the store buffer.

But the store buffer is always trying to drain itself as fast as possible and commit pending stores to L1d where they're globally visible. That's why it's rare to actually see 00. Usually one core will get exclusive ownership of the cache line and commit its store before the other core's load can run.

It's definitely not true that a write to a will be invisible to a load in another thread. It might or might not happen.

(Semi related: Store Load reordering is the "most important" one for performance, and the most expensive one to block. For example x86 asm always blocks the other kinds, being program-order + store-forwarding, so stuff reordering 2 stores wrt. each other can only happen at compile-time on x86. How does memory reordering help processors and compilers?)