multithreading rust atomic memory-barriers

How does this acquire-release relationship work?

In Rust Atomics and Locks, more or less the following code is suggested for appropriately implementing the drop trait of a simplified Arc: (code is mine)

        unsafe {
            if 1 == (*self.inner).strong.fetch_sub(1, Release) {
                fence(Acquire);
                let box_ = Box::rom_raw(self.inner);
                drop(box_);
            }
        }

where (*self.inner).strong is an AtomicUsize.

So here is my question:

I understand that an acquire will establish a "happens-before/synchronize-with" relationship with the previous release (with the thread where the loaded value was written). In order to properly drop the value, we need each Arc to have their fetch_sub linked one after the other.

I don't understand how this is guaranteed here. I would understand if we acquired the atomic's value, subbed one, and then released, all in a single operation (basically fetch_sub with AcqRel ordering or a compare and swap loop).

Here, we only "acquire" when we load 1. My understanding of using Release on fetch_sub is that the fetch part is relaxed. Based on this understanding, I have two issues with the aforementioned code:

1

Wouldn't this case be possible:

Thread A:
  load: 2 (t = 1)
  write 1 (t = ?)
Thread B:
  load: 2 (t = 1)
  write 1 (t = ?)

and thus the value was never dropped? I fail to see how the load has a guarantee that it has exclusive access over the value, implying it will always load in order.

2

Am I not establishing only a "happens-before" relationship between the reading of 1 and the drop? I think this is fine, since I can only drop if I indeed see a 1, but how am I guaranteed to see that 1 and not have a data race on the atomic?

I suspect my misunderstanding comes from the guarantees provided by atomics/ordering, but

Solution

Atomic read-modify-write instructions are always atomic, regardless of the memory ordering chosen. So if let's say self.inner has the value 5, and then you do fetch_sub(1, Relaxed) in any combination of different threads, you are still guaranteed that exactly one of them will load the value 2 and store the value 1.

Rust doesn't have a formally defined memory model, but AFAIK in practice it follows C++, which specifies this behavior by saying that "an atomic read-modify-write always loads the value that, in the modification order of the object, immediately precedes the value it stored." https://timsong-cpp.github.io/cppwp/n4868/atomics.order#10

You do have a happens-before relationship between the load of 1 and the drop, but that is trivial; for operations within a single thread, happens-before always follows program order. Confusingly, this does not mean that they cannot be reordered for execution!

What you really care about is that all prior accesses to box_, from other threads, must happen-before the drop. That's the condition that avoids a data race. And this you do get, thanks to the use of acquire and release ordering.

It would be a little simpler to see why if the fetch_sub used AcqRel ordering. In that case, we would have:

the last prior access happens-before the store of the value 1 (because it precedes that fetch_sub in program order within a single thread)
the store of 1 synchronizes with the load of 1, because the former is release, the latter is acquire, and the value observed by the load was the value put there by the store
the load of 1 happens-before the drop (program order)
happens-before is transitive.

You can argue similarly for all other accesses to the box.

But using AcqRel is somewhat inefficient, because we are paying for acquire ordering on every fetch_sub, but we really only need it for the last one (the one which loads the value 1). So in this code, we instead use Release ordering only on the fetch_sub, together with an Acquire fence which is only executed conditionally when the value loaded equals 1. The fence can be thought of as "retroactively" promoting the relaxed load of the fetch_sub to acquire. The formal proof takes a little more work, but it all does work out.

There's another side question: with Release ordering only, you no longer have a full happens-before chain involving all the fetch_subs, so how do you guarantee that the drop happens-after the second-to-last access? The C++ memory model handles this with the concept of a "release sequence". Normally, a load can only synchronize with a store if it loads the value put there by the store. But release sequences mean it also "counts" if you load a value that was produced by some sequence of atomic read-modify-writes acting on the value originally stored. So your drop not only happens-after the store of the value 1, it also happens-after a hypothetical store of the value 47 (since we could only have reached 1 via some long chain of atomic fetch_add and fetch_sub).