Hoisting of non-atomic loads up through acquiring atomic loads

I was under the impression that memory loads could not be hoisted above an acquiring load in the C++11 memory model. However looking at the code that gcc 4.8 produces that only seems to be true for other atomic loads, not all of memory. If that's true and acquiring loads don't synchronize all memory (just std::atomics) then I'm not sure how it would be possible to implement general purpose mutexes in terms of std::atomic.

The following code:

extern std::atomic<unsigned> seq;
extern std::atomic<int> data;

int reader() {
    int data_copy;
    unsigned seq0;
    unsigned seq1;
    do {
        seq0 = seq.load(std::memory_order_acquire);
        data_copy = data.load(std::memory_order_relaxed);
        std::atomic_thread_fence(std::memory_order_acquire);
        seq1 = seq.load(std::memory_order_relaxed);
    } while (seq0 != seq1);
    return data_copy;
}

Produces:

_Z6readerv:
.L3:
    mov ecx, DWORD PTR seq[rip]
    mov eax, DWORD PTR data[rip]
    mov edx, DWORD PTR seq[rip]
    cmp ecx, edx
    jne .L3
    rep ret

Which looks correct to me.

However changing data to be an int rather than std::atomic:

extern std::atomic<unsigned> seq;
extern int data;

int reader() {
    int data_copy;
    unsigned seq0;
    unsigned seq1;
    do {
        seq0 = seq.load(std::memory_order_acquire);
        data_copy = data;
        std::atomic_thread_fence(std::memory_order_acquire);
        seq1 = seq.load(std::memory_order_relaxed);
    } while (seq0 != seq1);
    return data_copy;
}

Produces this:

_Z6readerv:
    mov eax, DWORD PTR data[rip]
.L3:
    mov ecx, DWORD PTR seq[rip]
    mov edx, DWORD PTR seq[rip]
    cmp ecx, edx
    jne .L3
    rep ret

So what's going on?

Solution

Why a load was hoisted above an acquire

I've posted this on the gcc bugzilla and they've confirmed it as a bug.

the MEM alias-set of -1 (ALIAS_SET_MEMORY_BARRIER) is supposed to prevent this, but PRE does not know about this special property (it should "kill" all refs crossing it).

It looks like the gcc wiki has a nice page about this.

Generally, release is a barrier to sinking code, acquire is a barrier to hoisting code.

Why this code is still broken

As per this paper my code is still incorrect, because it introduces a data race. Even though the patched gcc generates the correct code, it's still not proper to access data without wrapping it in std::atomic. The reason is that data races are undefined behavior, even if computations resulting from them are discarded.

An example courtesy of AdamH.Peterson:

int foo(unsigned x) {
    if (x < 10) {
        /* some calculations that spill all the 
           registers so x has to be reloaded below */
        switch (x) {
        case 0:
            return 5;
        case 1:
            return 10;
        // ...
        case 9:
            return 43;
        }
    }
    return 0;
}

Here a compiler might optimise the switch into a jump table, and thanks to the if statement above would be able to avoid a range check. However if data races were not undefined behaviour then an second range check would be required.