c++compilation language-lawyer compiler-optimization

Is there a data race between writes that nominally happen under mutually exclusive conditions?

Consider the following code snippet:

// Main thread
int non_atom = 0;
bool x = _ // some value not known statically, depends on run-time arguments and complex calculations.

// Thread 1 and 2 are started

// Thread 1
if (!x) {
    non_atom = 5; // only write to non_atom if not x
}

// Thread 2
int bar = 0;
if (x) {
    bar = non_atom; // only access non_atom if x
}

At runtime only either T1 or T2 will access non-atom, so there should be no data race. But let's say the compiler decides to invert the if-condition in T2 like this:

int bar = non-atom;
if (!x) {
    bar = 0;
}

This isn't simply a reordering but rather an inversion of the conditional and its effects. There is now a data-race, because if x==true both threads access non-atom without synchronization.

Is the compiler allowed to do this? Let's assume that the compiler cannot statically prove that only one thread accesses x, for example because the code in T1 and T2 is in functions in separate translation units. Is the compiler allowed to introduce a data-race where dynamically, there wasn't one before, or rather, are compiler careful enough not to do such a transformation even if no atomics are involved anywhere? If I encounter such a situation in the real world, should I use an atomic instead just to be safe?

This question is similar to this older question: Data race guarded by if (false)... what does the standard say?

However, in this old question the conditional was statically always false, so it was clear that it should never be evaluated. In this question, the runtime value is not known to the compiler at compile-time, and the optimization is not a simple reordering.

This question was inspired by a post by Russ Cox (ctrl-f "Note that all these optimizations") on the Go Memory Model, in which he claims that such transformations ARE allowed by C++ compilers, in contrast to Go's.

Solution

The compiler is not allowed to transform the code that way on the source level, in the sense that it may not consider the programs equivalent. They are not equivalent because one has undefined behavior while the other doesn't.

However, the compiler can of course make use of knowledge about the variables and the architecture to implement the function in non-straightforward ways in machine instructions.

For example, the compiler knows that non_atom has static storage duration and therefore the memory reserved for non_atom is still accessible without possibly causing a fault, regardless of the value of x.

Then, the only problem with performing an extra load of non_atom would generally be that it may cause a teared value to be read or that it may not see the value written by the other thread. There is nothing else that could happen, typically. But neither of these can affect the result if the original program didn't execute a path with UB (data race). So doing the load independent of the value of x is not a problem.

So the compiler can do the load unconditionally in the machine code. It doesn't affect observable behavior of the program on any path that has a defined observable behavior. Whether they actually do that is a different question.

One exception to the above that I am aware of are floating-point values on x86 with floating point exceptions enabled and using the legacy x87 instructions (see comments below this answer). In that case loading the value unconditionally may cause a floating point exception, e.g. if the value read is a signaling NaN if x is false, which would result in the observable behavior not being correct.

Actually, this is an open problem at least on GCC (but I would guess also other compilers) since it doesn't take this possibility into account when transforming to such a speculative load and can therefore cause a floating point exception through the speculative load even though the program doesn't execute any path with a data race or other UB on the C++ level. See this bug report. That's technically not standard-conforming (if the configuration with exceptions on signaling NaNs is intended to be so). But as the comments say it is hard to correctly take care of during optimization.

Also, you need to make sure of course that any modification of x is synchronized with the access in either of the two shown threads, e.g. by having x be written before these threads are started. Otherwise your program does have undefined behavior due to a data race, but on x, not on non_atomic.