c++multithreading atomic memory-barriers

Can atomics suffer spurious stores?

In C++, can atomics suffer spurious stores?

For example, suppose that m and n are atomics and that m = 5 initially. In thread 1,

    m += 2;

In thread 2,

    n = m;

Result: the final value of n should be either 5 or 7, right? But could it spuriously be 6? Could it spuriously be 4 or 8, or even something else?

In other words, does the C++ memory model forbid thread 1 from behaving as though it did this?

    ++m;
    ++m;

Or, more weirdly, as though it did this?

    tmp  = m;
    m    = 4;
    tmp += 2;
    m    = tmp;

Reference: H.-J. Boehm & S. V. Adve, 2008, Figure 1. (If you follow the link, then, in the paper's section 1, see the first bulleted item: "The informal specifications provided by ...")

THE QUESTION IN ALTERNATE FORM

One answer (appreciated) shows that the question above can be misunderstood. If helpful, then here is the question in alternate form.

Suppose that the programmer tried to tell thread 1 to skip the operation:

    bool a = false;
    if (a) m += 2;

Does the C++ memory model forbid thread 1 from behaving, at run time, as though it did this?

    m += 2; // speculatively alter m
    m -= 2; // oops, should not have altered! reverse the alteration

I ask because Boehm and Adve, earlier linked, seem to explain that a multithreaded execution can

speculatively alter a variable, but then
later change the variable back to its original value when the speculative alteration turns out to have been unnecessary.

COMPILABLE SAMPLE CODE

Here is some code you can actually compile, if you wish.

#include <iostream>
#include <atomic>
#include <thread>

// For the orignial question, do_alter = true.
// For the question in alternate form, do_alter = false.
constexpr bool do_alter = true;

void f1(std::atomic_int *const p, const bool do_alter_)
{
    if (do_alter_) p->fetch_add(2, std::memory_order_relaxed);
}

void f2(const std::atomic_int *const p, std::atomic_int *const q)
{
    q->store(
        p->load(std::memory_order_relaxed),
        std::memory_order_relaxed
    );
}

int main()
{
    std::atomic_int m(5);
    std::atomic_int n(0);
    std::thread t1(f1, &m, do_alter);
    std::thread t2(f2, &m, &n);
    t2.join();
    t1.join();
    std::cout << n << "\n";
    return 0;
}

This code always prints 5 or 7 when I run it. (In fact, as far as I can tell, it always prints 7 when I run it.) However, I see nothing in the semantics that would prevent it from printing 6, 4 or 8.

The excellent Cppreference.com states, "Atomic objects are free of data races," which is nice, but in such a context as this, what does it mean?

Undoubtedly, all this means that I do not understand the semantics very well. Any illumination you can shed on the question would be appreciated.

ANSWERS

@Christophe, @ZalmanStern and @BenVoigt each illuminate the question with skill. Their answers cooperate rather than compete. In my opinion, readers should heed all three answers: @Christophe first; @ZalmanStern second; and @BenVoigt last to sum up.

Solution

The existing answers provide a lot of good explanation, but they fail to give a direct answer to your question. Here we go:

can atomics suffer spurious stores?

Yes, but you cannot observe them from a C++ program which is free from data races.

Only volatile is actually prohibited from performing extra memory accesses.

does the C++ memory model forbid thread 1 from behaving as though it did this?
++m;
++m;

Yes, but this one is allowed:

lock (shared_std_atomic_secret_lock)
{
    ++m;
    ++m;
}

It's allowed but stupid. A more realistic possibility is turning this:

std::atomic<int64_t> m;
++m;

into

memory_bus_lock
{
    ++m.low;
    if (last_operation_did_carry)
       ++m.high;
}

where memory_bus_lock and last_operation_did_carry are features of the hardware platform that can't be expressed in portable C++.

Note that peripherals sitting on the memory bus do see the intermediate value, but can interpret this situation correctly by looking at the memory bus lock. Software debuggers won't be able to see the intermediate value.

In other cases, atomic operations can be implemented by software locks, in which case:

Software debuggers can see intermediate values, and have to be aware of the software lock to avoid misinterpretation
Hardware peripherals will see changes to the software lock, and intermediate values of the atomic object. Some magic may be required for the peripheral to recognize the relationship between the two.
If the atomic object is in shared memory, other processes can see the intermediate values and may not have any way to inspect the software lock / may have a separate copy of said software lock
If other threads in the same C++ program break type safety in a way that causes a data race (For example, using memcpy to read the atomic object) they can observe intermediate values. Formally, that's undefined behavior.

One last important point. The "speculative write" is a very complex scenario. It's easier to see this if we rename the condition:

Thread #1

if (my_mutex.is_held) o += 2; // o is an ordinary variable, not atomic or volatile
return o;

Thread #2

{
    scoped_lock l(my_mutex);
    return o;
}

There's no data race here. If Thread #1 has the mutex locked, the write and read can't occur unordered. If it doesn't have the mutex locked, the threads run unordered but both are performing only reads.

Therefore the compiler cannot allow intermediate values to be seen. This C++ code is not a correct rewrite:

o += 2;
if (!my_mutex.is_held) o -= 2;

because the compiler invented a data race. However, if the hardware platform provides a mechanism for race-free speculative writes (Itanium perhaps?), the compiler can use it. So hardware might see intermediate values, even though C++ code cannot.

If intermediate values shouldn't be seen by hardware, you need to use volatile (possibly in addition to atomics, because volatile read-modify-write is not guaranteed atomic). With volatile, asking for an operation which can't be performed as-written will result in compilation failure, not spurious memory access.