Search code examples
c++multithreadingmemory-barriersstdatomicrelaxed-atomics

Are relaxed atomic store reordered themselves before the release? (similar with load /acquire)


I read in the en.cppreference.com specifications relaxed operations on atomics:

"[...]only guarantee atomicity and modification order consistency."

So, I was asking myself if such 'modification order' would work when you are working on the same atomic variable or different ones.

In my code I have an atomic tree, where a low priority, event based message thread fills which node should be updated storing some data on red '1' atomic (see picture), using memory_order_relaxed. Then it continues writing in its parent using fetch_or to know which child atomic has been updated. Each atomic supports up to 64 bits, so I fill the bit 1 in red operation '2'. It continues successively until the root atomic which is also flagged using the fetch_or but using this time memory_order_release.

enter image description here

Then a fast, real time, unblockable, thread loads the control atomic (with memory_order_acquire) and reads which bits have it enabled. Then it updates recursively the childs atomics with memory_order_relaxed. And that is how I sync my data with each cycle of the high priority thread.

Since this thread is updating, it is fine child atomics are being stored before its parent. The problem is when it stores a parent (filling the bit of the children to update) before I fill the child information.

In other words, as the tittle says, are the relaxed stores reordered between them before the release one? I don't mind non-atomic variables are reordered. Pseudo-code, suppose [x, y, z, control] are atomic and with initial values 0:

Event thread:
z = 1; // relaxed
y = 1; // relaxed
x = 1; // relaxed;
control = 0; // release

Real time thread (loop):
load control; // acquire
load x; // relaxed
load y; // relaxed
load z; // relaxed

I wonder if in the real time thread this would be true always: x <= y <=z. To check that I wrote this small program:

#define _ENABLE_ATOMIC_ALIGNMENT_FIX 1
#include <atomic>
#include <iostream>
#include <thread>
#include <assert.h>
#include <array>

using namespace std;
constexpr int numTries = 10000;
constexpr int arraySize = 10000;
array<atomic<int>, arraySize> tat;
atomic<int> tsync {0};

void writeArray()
{
    // Stores atomics in reverse order
    for (int j=0; j!=numTries; ++j)
    {
        for (int i=arraySize-1; i>=0; --i)
        {
            tat[i].store(j, memory_order_relaxed);
        }
        tsync.store(0, memory_order_release);
    }
}

void readArray()
{
    // Loads atomics in normal order
    for (int j=0; j!=numTries; ++j)
    {
        bool readFail = false;
        tsync.load(memory_order_acquire);

        int minValue = 0;
        for (int i=0; i!=arraySize; ++i)
        {
            int newValue = tat[i].load(memory_order_relaxed);
            // If it fails, it stops the execution
            if (newValue < minValue)
            {
                readFail = true;
                cout << "fail " << endl;
                break;
            }
            minValue = newValue;
        }

        if (readFail) break;
    }
}


int main()
{
    for (int i=0; i!=arraySize; ++i)
    {
        tat[i].store(0);
    }

    thread b(readArray);
    thread a(writeArray);

    a.join();
    b.join();
}

How it works: There is an array of atomic. One thread stores with relaxed ordering in reverse order and ends storing a control atomic with release ordering.

The other thread loads with acquire ordering that control atomic, then it loads with relaxed that atomic the rest of values of the array. Since the parents mustn't be updates before the children, the newValue should always be equal or greater than the oldValue.

I've executed this program on my computer several times, debug and release, and it doesn't trig the fail. I'm using a normal x64 Intel i7 processor.

So, is it safe to suppose that relaxed stores to multiple atomics do keep the 'modification order' at least when they are being sync with a control atomic and acquire/release?


Solution

  • Sadly, you will learn very little about what the Standard supports by experiment with x86_64, because the x86_64 is so well-behaved. In particular, unless you specify _seq_cst:

    • all reads are effectively _acquire

    • all writes are effectively _release

    unless they cross a cache-line boundary. And:

    • all read-modify-write are effectively seq_cst

    Except that the compiler is (also) allowed to re-order _relaxed operations.

    You mention using _relaxed fetch_or... and if I understand correctly, you may be disappointed to learn that is no less expensive than seq_cst, and requires a LOCK prefixed instruction, carrying the full overhead of that.


    But, yes _relaxed atomic operations are indistinguishable from ordinary operations as far as ordering is concerned. So yes, they may be reordered wrt to other _relaxed atomic operations as well as not-atomic ones -- by the compiler and/or the machine. [Though, as noted, on x86_64, not by the machine.]

    And, yes where a release operation in thread X synchronizes-with an acquire operation in thread Y, all writes in thread X which are sequenced-before the release will have happened-before the acquire in thread Y. So the release operation is a signal that all writes which precede it in X are "complete", and when the acquire operation sees that signal Y knows it has synchronized and can read what was written by X (up to the release).

    Now, the key thing to understand here is that simply doing a store _release is not enough, the value which is stored must be an unambiguous signal to the load _acquire that the store has happened. For otherwise, how can the load tell ?

    Generally a _release/_acquire pair like this are used to synchronize access to some collection of data. Once that data is "ready", a store _release signals that. Any load _acquire which sees the signal (or all loads _acquire which see the signal) know that the data is "ready" and they can read it. Of course, any writes to the data which come after the store _release may (depending on timing) also be seen by the load(s) _acquire. What I am trying to say here, is that another signal may be required if there are to be further changes to the data.

    Your little test program:

    1. initialises tsync to 0

    2. in the writer: after all the tat[i].store(j, memory_order_relaxed), does tsync.store(0, memory_order_release)

      so the value of tsync does not change !

    3. in the reader: does tsync.load(memory_order_acquire) before doing tat[i].load(memory_order_relaxed)

      and ignores the value read from tsync

    I am here to tell you that the _release/_acquire pairs are not synchronizing -- all these stores/load may as well be _relaxed. [I think your test will "pass" if the the writer manages to stay ahead of the reader. Because on x86-64 all writes are done in instruction order, as are all reads.]

    For this to be a test of _release/_acquire semantics, I suggest:

    1. initialises tsync to 0 and tat[] to all zero.

    2. in the writer: run j = 1..numTries

      after all the tat[i].store(j, memory_order_relaxed), write tsync.store(j, memory_order_release)

      this signals that the pass is complete, and that all tat[] is now j.

    3. in the reader: do j = tsync.load(memory_order_acquire)

      a pass across tat[] should find j <= tat[i].load(memory_order_relaxed)

      and after the pass, j == numTries signals that the writer has finished.

    where the signal sent by the writer is that it has just completed writing j, and will continue with j+1, unless j == numTries. But this does not guarantee the order in which tat[] are written.

    If what you wanted was for the writer to stop after each pass, and wait for the reader to see it and signal same -- then you need another signal and you need the threads to wait for their respective "you may proceed" signal.