Does the C11 memory model really conflict with common optimizations?

The OP of a recent question added a comment to it linking a paper entitled Common Compiler Optimisations are Invalid in the C11 Memory Model and what we can do about it, which apparently was presented at POPL 2015. Among other things, it purports to show several unexpected and counterintuitive conclusions derived from the specifications for what it calls the "C11 memory model", which I take to consist largely of the provisions of section 5.1.2.4 of the C11 language specification.

The paper is somewhat lengthy, but for the purposes of this question I focus on the discussion of scheme "SEQ" on the second page. This concerns a multithreaded program in which ...

a is non-atomic (for example, an int),
x and y are atomic (for example, _Atomic int), and
a, x, and y all initially have value 0,

... and the following occurs (transliterated from pseudocode):

Thread 1

a = 1;

Thread 2

if (atomic_load_explicit(&x, memory_order_relaxed))
    if (a)
        atomic_store_explicit(&y, 1, memory_order_relaxed);

Thread 3

if (atomic_load_explicit(&y, memory_order_relaxed))
    atomic_store_explicit(&x, 1, memory_order_relaxed);

The paper makes this argument:

First, notice that there is no execution (consistent execution in the terminology of Section 2) in which the load of a occurs. We show this by contradiction. Suppose that there is an execution in which a load of a occurs. In such an execution the load of a can only return 0 (the initial value of a) because the store a=1 does not happen before it (because it is in a different thread that has not been synchronised with) and non-atomic loads must return the latest write that happens before them. Therefore, in this execution the store to y does not happen, which in turn means that the load of y cannot return 1 and the store to x also does not happen. Then, x cannot read 1, and thus the load of a does not occur. As a consequence this program is not racy: since the load of a does not occur in any execution, there are no executions with conflicting accesses on the same non-atomic variable. We conclude that the only possible final state is a=1 ∧ x=y=0.

(Question 1) But isn't that argument fatally flawed?

The assertion that the load of a can only return 0 is made subject to the assumption that a is in fact read, which the argument intends to contradict. But in that case, as the paper observes, there is no happens before relationship between the store to a in thread 1 and the load from a in thread 2. These are conflicting accesses, neither is atomic, and one is a write. Therefore, per paragraph 5.1.2.4/25, the program contains a data race resulting in undefined behavior. Because the behavior is undefined, nothing can be concluded about the value loaded from a by thread 2, and in particular, it cannot be concluded from the specification that the load must return 0. The rest of the argument then collapses.

Although the paper claims that the argument shows that the program does not contain a data race ("is not racy"), in fact that is not a consequence of the argument but rather a hidden assumption. Only if, contrary to 5.1.2.4/25, the program did not contain a data race would the argument stand up.

Now perhaps the key is that the argument above considers only "consistent executions", a term defined in a later section of the paper. I confess that it gets a little deep for me at that point, but if in fact constraining the behavior to consistent executions is sufficient to support the assertion that the load of a must return 0, then it seems that it is no longer (just) the rules of the C11 memory model that we are talking about.

This matters because the authors conclude that a source-to-source translation combining threads 1 & 2 to yield ...

Thread 2'

a = 1;
if (atomic_load_explicit(&x, memory_order_relaxed))
    if (a)
        atomic_store_explicit(&y, 1, memory_order_relaxed);

... is not permitted of implementations by the C11 memory model, on the basis that it allows executions in which the final state is a = x = y = 1. That this and various other code transformations and optimizations are invalid is the thesis of the paper.

(Question 2) But isn't it indeed valid for a C11 implementation to treat the original three-threaded program as if it were the two-threaded program consisting of threads 2' and 3?

If that allows "consistent executions" with results that could not be produced by any consistent execution of the original program, then I think that just shows that C11 does not constrain programs to exhibiting consistent executions. Am I missing something here?

Solution

Apparently, no one is both interested enough and confident enough to write an answer, so I guess I'll go ahead.

isn't that argument fatally flawed?

To the extent that the proof quoted from the paper is intended to demonstrate that a conforming C implementation is not permitted to perform the source-to-source transformation described in the question, or an equivalent, yes, the proof is flawed. The refutation presented in the question is sound.

There was some discussion in comments about how the refutation could be viewed as boiling down to anything being permissible in the event of undefined behavior. That is a valid perspective, and in no way does it undercut the argument. However, I think it's unnecessarily minimalistic.

Again, the key problem with the paper's proof is here:

the load of a can only return 0 (the initial value of a) because the store a=1 does not happen before it (because it is in a different thread that has not been synchronised with) and non-atomic loads must return the latest write that happens before them.

The proof's error is that the language specification's requirement that a read of a must return the result of a write to a that "happened before" it is conditioned on the program being free of data races. This is an essential foundation for the whole model, not some kind of escape hatch. The program manifestly is not free of data races if in fact the read of a is performed, so the requirement is moot in that case. The read of a by thread 2 absolutely can observe the write by thread 1, and there is good reason to suppose that it might sometimes do so in practice.

To look at it another way, the proof chooses to focus on the write not happening before the read, but ignores the fact that the read also does not happen before the write.

Taking the relaxed atomic accesses into account does not change anything. It is plausible that in a real execution of the paper's three-threaded program, the implementation (for example) speculatively executes the relaxed load of x in thread 2 on the assumption that it will return 1, then reads from a the value written by thread 1, and as a result, executes the store to y. Because the atomic accesses are performed with relaxed semantics, the execution of thread 3 can read the value of y as 1 (or speculate that it will do so) and consequently perform the write to x. All speculations involved can then be confirmed correct, with the final result that a = x = y = 1. It is intentional that this seemingly paradoxical result is allowed by the "relaxed" memory order.

isn't it indeed valid for a C11 implementation to treat the original three-threaded program as if it were the two-threaded program consisting of threads 2' and 3?

At minimum, the paper's argument does not show otherwise, even if we -- with no basis in the specification -- construe the scope of the UB arising from the data race to be limited to whether the value read from a is its initial one or the one written by thread 1.

Implementations are given broad license to behave as they choose, so long as they produce observable behavior that is consistent with the behavior required of the abstract machine. The creation and execution of multiple threads of execution is not itself part of the observable behavior of a program, as that is defined by the specification. Therefore, yes, a program that performed the proposed transformation and then behaved accordingly, or one that otherwise behaved as if there were a happens before edge between the write to a and the read from a, would not be acting inconsistently with the specification.

Addendum

Following the recent alternative answer and in view of the commentary on that, it seems appropriate to include some remarks in this answer about how that one and this one come to different conclusions.

Main controversy

The other answer summarizes this key characteristic of the paper's analysis:

So the check for a "race" (and therefore "undefined behaviour") is done only on the "consistent executions", not on "candidate executions".

The other answer seems to subscribe to that approach, apparently accepting that filtering out candidate executions without regard for the impact of UB is valid. It does not provide a general justification for that, but in the case of the example program, it claims that it is justified by the spec's rule for which side effects are visible by which evaluations (5.1.2.4/19).

Objection 1: specifications of UB cannot be disregarded

The rule cited by the other answer does not express any precedence over the one that indicates that the program's behavior is undefined, nor does it contradict the definition holding that the spec places no requirements on a program with UB.

In practice, the well-established convention for interpreting the spec with respect to UB is that UB indeed has the consequences its definition says it has. No provisions of the spec that otherwise would require particular program behavior apply to programs with UB. The existence of rules specifying program behavior does not provide any justification for disregarding rules that explicitly specify program behavior to be undefined. It is in fact the reverse.

Side consideration: proving UB

The matter arose in discussion of how one can then prove that a program has UB. If one is reasoning about a program based on the spec, and arrives at a provision such as 5.1.2.4/24 that specifies the program to have UB, then does that not undermine the chain of reasoning by which that conclusion was reached?

Yes and no. We could observe that undertaking such a chain of reasoning in the first place necessarily starts with an assumption that the program has defined behavior. If that allows one to conclude that the program has UB according to some explicit specification, then that contradicts the initial assumption, proving UB by contradiction, if not directly.

Alternatively, it is also possible to rely on paragraph 4/2, which says, among other things,

Undefined behavior is indicated [by] the omission of any explicit definition of behavior

Therefore, even if one rejects the validity of the proof by contradiction, one is left with UB in such cases simply on the basis that the spec does not provide any coherent definition of the program's behavior.

Objection 2: "consistent executions" are not the only ones allowed by the C language spec

Consider this simpler program:

non-atomic, shared ints a and b are initially 0.
thread 1 performs a = 1;
thread 2 performs b = a;

Following the paper's approach, we exclude from the list of consistent executions any in which thread 2's read of a observes a value other than 0, and any in which the final value of a is not 1. That leaves just one consistent execution, but when that execution is analyzed for data races, we find one involving a. What now?

The language spec places no requirements on the behavior of a program containing a data race, so we must reconsider our previous exclusions: all candidate executions are allowed (that is, they are consistent with the language spec), even those that are not consistent per the paper.

But what then is the significance of the paper's kind of consistent executions? And if executions that are not consistent are nevertheless allowed in some cases, then on what basis do we exclude all such executions from an analysis of whether the program contains data races? In fact, we do want to exclude many candidate executions from that analysis, but I submit that we cannot justify excluding those for which doing so requires assuming absence of a data race in the first place.

Objection 3: the paper's interpretation is inconsistent with real-world expectations and behavior

The paper and the other answer suppose that because the non-atomic store to a in thread 1 of the example program has no happens-before relationship with the load from a in thread 2, the implementation is required to prevent the load from observing the store. That's manifestly untrue, as the previous objection shows, but it's not even a reasonable expectation.

Actual practice and general expectation is that non-atomic accesses follow the natural storage behavior of the machine, not requiring any special attention above and beyond that, including in multi-threaded programs. In a program without data races, the required memory behavior, including for non-atomic accesses, is expected to be assured by correct implementation of atomic operations and other synchronization operations, without any extra requirements on the handling of non-atomic accesses. That is the underlying basis for 5.1.2.4/19, for example.

The point of attributing undefined behavior to data races is exactly to relieve implementations from satisfying those provisions in a program with data races. (No doubt specifically with regard to accesses to the racy object, but the committee prefers to use existing tools such as UB, and sometimes paints with a broad brush.) This is the perspective from which I'm speaking when I say that being race free is foundational, or that the other provisions are contingent on that. Yes, they are contingent in a technical sense, because everything is contingent on the program having defined behavior, but that's the means by which the committee met its objectives to free programs from obligation in the presence of data races.

It is therefore wrong to apply 5.1.2.4/19 or other rules of the memory model without first considering whether there are data races. The paper inverts that relationship.

Objection 4: the paper thereby creates its own problem

At least some of the deficiencies the paper's authors claim to have found in the C memory model depend on their model's approach to data races. A common theme is that executions containing data races are claimed to afford fewer consistent executions than similar ones without, and that that renders various source-to-source translations invalid that the authors claim (and I agree) were meant to be valid.

But those are manufactured problems, resolvable by correcting the model so that it integrates handling of data races in a way that leads to predictions matching the allowable behaviors that we all seem to agree were intended. Agreement with our expectations is among the primary criteria by which we should judge a formal model such as the paper's. A model corrected along these lines would be messier to formalize, and I anticipate that it would be weaker, but that just mirrors reality: the C memory model is weaker than the paper's formalization of it.

Closing example of a SEQ execution

There was some discussion in conjunction with the other question of an example execution of program SEQ that is allowed (or should be), and yields final state a=1 ∧ x=y=1. Here is one:

Thread 1	Thread 2	Thread 3
1.1 store 1 in `a` (non-atomic)	2.1 load 1 from `x` (relaxed) 2.2 load 1 from `a` (non-atomic) 2.3 store 1 in y (relaxed)	3.1 load 1 from `y` (relaxed) 3.2 store 1 in `x` (relaxed)

The full list of happens-before relationships here are

2.1 hb 2.2 hb 2.3 (following from the sequenced-before relationships in thread 2)
3.2 hb 2.1 (the latter sees the side effect of the former)
2.4 hb 3.1 (the latter sees the side effect of the former)
3.1 hb 3.2 (following from the sequenced-before relationships in thread 2)

I note that there is a happens-before cycle, but that the memory model does not forbid it. Each atomic read maps to a corresponding atomic write that happens before it and writes the value that is read. There is a conflicting read and write on a with no happens-before relationship relating them, so (and this is where we diverge from the paper) the program behavior is undefined, therefore 2.2 is allowed (as is any and everything else).