Search code examples
c++concurrencylanguage-lawyercpu-architecturelock-free

Can a load or store be reordered before a conditional?


std::atomic_uint64_t writing_ {0};
std::atomic_uint64_t reading_ {0};
std::array<type, size> storage_ {};

bool try_enqueue(type t) noexcept
{
    const std::uint64_t writing {
        writing_.load(std::memory_order::memory_order_relaxed)};
    const auto last_read {reading_.load(std::memory_order::memory_order_relaxed)};
    if (writing - last_read < size) {
        storage_.at(writing & (size - 1)) = t;
        writing_.store(writing + 1, std::memory_order::memory_order_release);

        return true;
    }
    else
        return false;
}

In the above code, as I understand it, if the condition evaluates to false, it is impossible for any thread to observe a write to the shared storage. Is it correct that an operation cannot be perceived as having occurred before a conditional it is sequenced after? Or am I completely misreading this and such a thing could actually occur (perhaps via speculative execution?)?

A little more specifically, could the processor speculatively execute the write (when the condition will eventually evaluate to false), another thread observe the write as having occurred, and then the first thread discarding the speculative write?

(note: this is single-producer single-consumer)


Solution

  • Is it correct that an operation cannot be perceived as having occurred before a conditional it is sequenced after?

    C++ compilers definitely aren't allowed to invent writes to atomic (or volatile) objects.

    Compilers aren't even allowed to invent writes to non-atomic objects (e.g. turn a conditional write into read + cmov + write) because C++11 introduced a memory model that makes it well-defined for two threads to run code like that at the same time as long as at most one of them actually writes (and a read is sequenced after). But two non-atomic RMWs could step on each other so don't work "as-if" the C++ abstract machine was running the source code, so it's not an option for the compiler to emit asm that does that.

    But if the compiler knows an object is always written, it can pretty much do whatever it wants because a legal program can't observe the difference: that would involve data-race UB.


    A little more specifically, could the processor speculatively execute the write (when the condition will eventually evaluate to false), another thread observe the write as having occurred, and then the first thread discarding the speculative write?

    No, speculation doesn't escape the core doing the speculation. Otherwise when mis-speculation is detected, all cores would have to roll back their state!

    This is one of the primary reasons for store buffers existing: to decouple OoO speculative execution of stores from commit to L1d cache (which is when the store becomes globally visible to other cores). And to decouple execution from cache-miss stores which is useful even on in-order non-speculative CPUs.

    Stores don't commit to L1d until after the store instruction has retired from the out-of-order core (i.e. is known to be non-speculative). Retired stores that haven't committed yet are sometimes called "graduated" to distinguish them from other store-buffer entries that could potentially be discarded if the core needs to roll back to retirement state.

    This allows hardware speculative execution without inventing writes.

    (see also Can a speculatively executed CPU branch contain opcodes that access RAM? for more details. Fun fact: some CPUs, specifically PowerPC, can do store-forwarding of graduated stores between SMT threads on the same physical core, making stores visible to some cores before they become globally visible. But only for graduated stores, otherwise possible mis-speculation could leak out.)


    In C++, a std::mo_release store forces the compiler to use sufficient barriers or release-store instructions (e.g. a normal mov on x86 is a release-store, or stlr on AArch64 is a sequential-release store). Or whatever other mechanism to make sure the asm guarantees runtime ordering at least as strong as the C++ abstract machine guarantees.

    C++ defines its standard in terms of sequenced-before / after, not barriers, but on any given platform implementations / the ABI standardize on some mapping from std::atomic operations to asm sequences. (e.g. https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)