c++x86 cpu-architecture memory-barriers memory-model

Demonstrate LoadStore reordering with a load getting a value round-tripped to another thread, in practice with relaxed load/store?

#include <atomic>
#include <thread>

void test_relaxed()
{
    using namespace std;
    atomic<int> x{0};
    atomic<int> y{0};

    std::thread t1([&] {
        auto r1 = y.load(memory_order_relaxed); //a
        x.store(r1, memory_order_relaxed); //b
        });

    std::thread t2([&] {
        auto r2 = x.load(memory_order_relaxed); //c
        y.store(42, memory_order_relaxed); //d
    });
    
    t1.join();
    t2.join();
}

According to cppreference (in a relaxed ordering example), the above code is allowed to produce r1 == r2 == 42.

But I have tested it on x86-64 and arm64 platforms and I cannot get this result. Is there any way to get it in practice with real compilers and CPUs?

(Godbolt)

Solution

According to the ARM Memory Tool (the article, the online tool) arm64 allows this behavior (which means it might occur on some arm64 cpus).

The following is a litmus test for your example:

AArch64 SO-q-2023-03-06
{
0:X1=x; 0:X3=y;
1:X1=y; 1:X3=x;
}
 P0          | P1          ;
 LDR W0,[X1] | LDR W0,[X1] ;
 STR W0,[X3] | MOV W2,#42  ;
             | STR W2,[X3] ;
exists
(0:X0=42 /\ 1:X0=42)

You can try it yourself in the online tool.

But there could be problems with finding the arm64 hardware that displays such behavior.

I have no data for arm64, but there is such data for ARMv7 (it might give you insight, or you might want to try to reproduce you example on ARMv7).
You test case is very similar to LB+data+po litmus test.
The results for the litmus test on different hardware is here: as you can see, it reproduces only on some hardware.
Meaning of the hardware abbreviations used in the table is given here.