C++: Which weak atomic to use for buffers that receive async. RDMA transfers?

The Derecho system (open-source C++ library for data replication, distributed coordination, Paxos -- ultra-fast) is built around asynchronous RDMA networking primitives. Senders can write to receivers without pausing, using RDMA transfers into receiver memory. Typically this is done in two steps: we transfer the data bytes in one operation, then notify the receiver by incrementing a counter or setting a flag: "message 67 is ready for you, now". Soon the receiver will notice that message 67 is ready, at which point it will access the bytes of that message.

Intended semantic: "seeing the counter updated should imply that the receiver's C++ code will see the bytes of the message." In PL terms, we need a memory fence between the update of the guard and the bytes of the message. The individual cache-lines must also be sequentially consistent: my guard will go through values like 67, 68, .... and I don't want any form of mashed up value or non-monotonic sequencing, such as could arise if C++ reads a stale cache line, or mistakenly holds a stale value in memory. Same for the message buffer itself: these bytes might overwrite old bytes and I don't want to see some kind of mashup.

This is the crux of my question: I need a weak atomic that will impose [exactly] the needed barrier, without introducing unwanted overheads. Which annotation would be appropriate? Would the weak atomic annotation be the same for the "message" as for the counter (the "guard")?

Secondary question: If I declare my buffer with the proper weak atomic, do I also need to say that it is "volatile", or will C++ realize this because the memory was declared weakly atomic?

Solution

An atomic counter, whatever its type, will not guarantee anything about memory not controlled by the CPU. Before the RDMA transfer starts, you need to ensure the CPU's caches for the RDMA region are flushed and invalidated, and then of course not read from or write to that region while the RDMA transfer is ongoing. When the RDMA device signals the transfer is done, then you can update the counter.

The thread that is waiting for the counter to be incremented should not reorder any loads or stores done after reading the counter, so the correct memory order is std::memory_order_acquire. So basically, you want Release-Acquire ordering, although there is nothing to "release" in the thread that updates the counter.

You don't need to make the buffers volatile; in general you should not rely on volatile for atomicity.