How do fences actually work in c++

I've been struggling with understanding how fences actually force code to synchronize.

for instance, say i have this code

bool x = false;
std::atomic<bool> y;
std::atomic<int> z;
void write_x_then_y()
{
    x = true;
    std::atomic_thread_fence(std::memory_order_release);
    y.store(true, std::memory_order_relaxed);
}
void read_y_then_x()
{
    while (!y.load(std::memory_order_relaxed));
    std::atomic_thread_fence(std::memory_order_acquire);
    if (x)
        ++z;
}
int main()
{
    x = false;
    y = false;
    z = 0;
    std::thread a(write_x_then_y);
    std::thread b(read_y_then_x);
    a.join();
    b.join();
    assert(z.load() != 0);
}

because the release fence is followed by an atomic store operation, and the acquire fence is preceded by an atomic load, everything synchronizes as it's supposed to and the assert won't fire

but if y was not an atomic variable like this

bool x;
bool y;
std::atomic<int> z;
void write_x_then_y()
{
    x = true;
    std::atomic_thread_fence(std::memory_order_release);
    y = true;
}
void read_y_then_x()
{
    while (!y);
    std::atomic_thread_fence(std::memory_order_acquire);
    if (x)
        ++z;
}

then, I hear, there might be a data race. But why is that? Why must release fences be followed by an atomic store, and acquire fences be preceded by an atomic load in order for the code to synchronize properly?

I would also appreciate it if anyone could provide an execution scenario in which a data race causes the assert to fire

Solution

No real data race is a problem for your second snippet. This snippet would be OK ... if the compiler would literally generate machine code from the one which is written.

But the compiler is free to generate any machine code, which is equivalent to the original one in case of a single-threaded program.

E.g., compiler can note, that the y variable doesn't changes within while(!y) loop, so it can load this variable once to register and use only that register in the next iterations. So, if initially y=false, you will get an infinite loop.

Another optimization, which is possible, is just removing the while(!y) loop, as it doesn't contain accesses to volatile or atomic variables and doesn't use synchronization actions. (C++ Standard says that any correct program should eventually do one of the actions specified above, so the compiler may rely on that fact when optimizing the program).

And so on.

More generally, the C++ Standard specifies that concurrent access to any non-atomic variable lead to Undefined Behavior, which is like "Warranty is cleared". That is why you should use an atomic y variable.

From the other side, variable x doesn't need to be atomic, as accesses to it are not concurrent because of the memory fences.