c++atomic apple-m1 arm64 memory-barriers

C++ memory order on Apple M1 chip not work: reordering happens even with seq_cst in a StoreStore / LoadLoad litmus test?

With the seq_cst memory order, the following code should never have v1 == 0 and v2 == 2 . But it still just prints Reorder happened on my Apple M1 chip. I really don't know why.

#include <semaphore.h>
#include <cstdio>
#include <atomic>
#include <thread>

std::atomic<int> v1, v2;
sem_t start_1, start_2, complete;

int main() {
    sem_init(&start_1, 0, 0);
    sem_init(&start_2, 0, 0);
    sem_init(&complete, 0, 0);

    std::thread t1([&] {
        while (true) {
            sem_wait(&start_1);

            v1.store(1, std::memory_order_seq_cst);
            asm volatile("":: : "memory");
            v2.store(2, std::memory_order_seq_cst);

            sem_post(&complete);
        }
    });

    std::thread t2([&] {
        while (true) {
            sem_wait(&start_2);

            int val1 = v1.load(std::memory_order_seq_cst);
            asm volatile("":: : "memory");
            int val2 = v2.load(std::memory_order_seq_cst);

            if (val1 == 0 && val2 == 2) {
                puts("Reorder happened");
            }

            sem_post(&complete);
        }
    });

    for (int i = 0; i < 1000000; i++) {
        v1 = v2 = 0;
        sem_post(&start_1);
        sem_post(&start_2);

        sem_wait(&complete);
        sem_wait(&complete);
    }

    t1.detach();
    t2.detach();
    return 0;
}

Update:

Thanks a lot to the Peter Cordes's answer, I missed one of the interleaving possibility. But I'm still confused by the following code. Still, with seq_cst order, I think r1 == 0 and r2 == 0 shouldn't happen. It works well on my x86 Intel machine, but not on my Apple M1 chip.

#include <thread>
#include <semaphore.h>
#include <cstdio>
#include <atomic>

std::atomic<int> v1, v2;
std::atomic<int> r1, r2;
sem_t start_1, start_2, complete;

int main() {
    sem_init(&start_1, 0, 0);
    sem_init(&start_2, 0, 0);
    sem_init(&complete, 0, 0);

    std::thread t1([&] {
        while (true) {
            sem_wait(&start_1);

            v1.store(1, std::memory_order_seq_cst);
            asm volatile("":: :"memory");
            int val = v2.load(std::memory_order_seq_cst);
            asm volatile("":: :"memory");
            r1.store(val, std::memory_order_seq_cst);

            sem_post(&complete);
        }
    });

    std::thread t2([&] {
        while (true) {
            sem_wait(&start_2);

            v2.store(1, std::memory_order_seq_cst);
            asm volatile("":: :"memory");
            int val = v1.load(std::memory_order_seq_cst);
            asm volatile("":: :"memory");
            r2.store(val, std::memory_order_seq_cst);

            sem_post(&complete);
        }
    });

    for (int i = 0; i < 1000000; i++) {
        v1 = v2 = 0;
        sem_post(&start_1);
        sem_post(&start_2);

        sem_wait(&complete);
        sem_wait(&complete);

        if (r1 == 0 && r2 == 0) {
            printf("reorder detected @ %d\n", i);
        }
    }

    t1.detach();
    t2.detach();
    return 0;
}

Solution

Those values are explainable with seq_cst's interleaving of program-order between threads.

  v1.load gets 0
                     both stores in order
  v2.load gets 2

A correct litmus test for StoreStore or LoadLoad would need to load in the opposite order from stores and see 0 in the second load result, non-zero in the first.

In general, always try to make sure you've ruled out simple interleaving explanations when constructing a reordering litmus test. Think about what would happen if you stopped in a debugger and single-stepped one thread arbitrarily far while others were stopped. Or any combination of progress vs. stall between threads.

You probably won't see such reordering in practice even with relaxed from just switching the load order. Both variables probably end up in the same cache line, so the CPU would have no reason to reorder them. Especially since both load addresses are ready at the same time. See C++ atomic variable memory order problem can not reproduce LoadStore reordering example for some suggestions for real-world demos of memory-reordering effects allowed by C++ and the ISA you're targeting.

Related not-quite-duplicate: Can't get c++'s seq_cst memory model to work - another case of a poorly-constructed litmus test where the supposedly-interesting result can happen with seq_cst. That was for IRIW reordering.

BTW, asm volatile("":: : "memory"); isn't necessary; seq_cst already forbids compile-time reordering of seq_cst operations with each other. I assume you just put it in to try to make sure when you found a surprising result.

I'm only answering the original question only (the first half of the current question). See Nate's answer re: the second half, which should really be posted separately. It looks like a correctly-designed litmus test for a different effect (StoreLoad reordering, which can happen even on x86 with orders weaker than seq_cst), and the answer / explanation for it is unrelated.

One interesting difference between AArch64 and x86 is that AArch64 seq_cst stores are about as weak as ISO C++ allows, the same machine instruction as for release stores, stlr, and can reorder with later ops other than release stores and seq_cst loads/RMWs.

StoreLoad ordering wrt. later seq_cst loads is done by ldar having a special interaction with earlier stlr, waiting for them before taking a value from cache. But it can reorder with earlier plain stores and plain loads.

Unlike on many ISAs including x86 where the only option that's strong enough is a full barrier somewhere, preventing all StoreLoad reordering. Since cheap loads are more important than cheap stores, the usual design is to have stores pay the price. Which makes them slow even if there are no seq_cst loads any time soon (or at all) in the same thread. e.g. x86 normally uses xchg mem, reg for seq_cst stores, which has an implicit lock prefix.

But as Nate explains, there's nothing subtle about why your litmus test fails on AArch64 macOS.