multithreading concurrency arm atomic arm64

ARM STLR memory ordering semantics

I'm struggling with the exact semantics of the ARM STLR.

According to the documentation it has release semantics. So if you would have STLR store, you would get:

[StoreStore][LoadStore]
X=r1

Whereby X is memory and r1 is some register.

The problem is that a release store and acquire load, fails to provide sequential consistency:

[StoreStore][LoadStore]
X=r1
r2=Y
[LoadLoad][LoadStore]

In the above case it is allowed that the X=r1 and r2=Y get reordered. To make this sequential consistent, a [StoreLoad] needs to be added:

[StoreStore][LoadStore]
X=r1
[StoreLoad]
r2=Y
[LoadLoad][LoadStore]

And you normally do this in the store because loads are more frequent.

On the X86 plain stores are release stores and plain loads are acquire loads. And the [StoreLoad] can be implemented by an MFENCE or using LOCK ADDL %(RSP),0 as is done in Hotspot JVM.

When looking at the ARM documentation, it seems that a LDAR has acquire semantics; so that would be [LoadLoad][LoadStore].

But the semantics of the STLR are vague. When I compile a C++ atomic using memory_order_seq_cst, there is just a STLR; there is no DMB. So it seems that the STLR has much stronger memory ordering guarantees than release store. To me it seems that on a fences level a STLR is equivalent to:

 [StoreStore][LoadStore]
 X=r1
 [StoreLoad]

Could someone shed some light on this?

Solution

I'm just learning about this stuff, so take with a grain of salt. But my understanding is that in ARMv8/AArch64, STLR/LDAR do provide additional semantics beyond the usual definitions of release/acquire, but not as strong as your suggestion. Namely, a release store STLR does have sequential consistency with an acquire load LDAR that follows it in program order, but not with ordinary LDR loads.

From the ARMv8 Architecture Reference Manual, B2.3.7, "Load-Acquire, Load-AcquirePC, and Store-Release":

Where a Load-Acquire appears in program order after a Store-Release, the memory access generated by the Store-Release instruction is Observed-by each PE to the extent that PE is required to observe the access coherently, before the memory access generated by the Load-Acquire instruction is Observed-by that PE, to the extent that the PE is required to observe the access coherently.

And from B2.3.2, "Ordering relations":

A read or a write RW1 is Barrier-ordered-before a read or a write RW2 from the same Observer if and only if RW1 appears in program order before RW2 and any of the following cases apply: [...] RW1 is a write W1 generated by an instruction with Release semantics and RW2 is a read R2 generated by an instruction with Acquire semantics.

As a test, I borrowed a C++ implementation of Peterson's locking algorithm by LWimsey. With clang 11.0 on godbolt, you can see that even when sequential consistency is requested, the compiler still generates STLR, LDAR to take the lock (lines 18-19 of the assembly), with no DMB. I ran it for a while (Raspberry Pi 4B, Cortex A72, 4 cores) and got no violations.

However, contrary to your idea, STLR can still be reordered with respect to ordinary (non-acquire) loads that follow it, so it does not implicitly have a full StoreLoad fence. I modified LWimsey's program to use STLR, LDR instead, and after adding some extra garbage to provoke the race, I was able to see lock violations.

Likewise, LDAR can be reordered with respect to ordinary (non-release) stores that precede it. I was similarly able to get lock violations with STR, LDAR in the test program.