Search code examples
cassemblyatomicstdatomic

`atomic_compare_exchange_strong_explicit()` -- what do the various combinations of `success` and `failure` parameter do, when not equal?


The atomic_compare_exchange_strong_explicit() function takes two memory_order parameters, success and failure (as does atomic_compare_exchange_weak_explicit()). Unpicking the C11/C18 Standards, I find that the allowed values for success and failure are:

    success = memory_order_relaxed     failure = memory_order_relaxed

    success =             _release     failure =             _relaxed

    success =             _consume     failure =             _relaxed
                                             or              _consume

    success =             _acquire     failure =             _relaxed
          or              _acq_rel           or              _consume (?)
                                             or              _acquire

    success =             _seq_cst     failure =             _relaxed
                                             or              _consume (?)
                                             or              _acquire
                                             or              _seq_cst

The Standard also says:

Further, if the comparison is true, memory is affected according to the value of success, and if the comparison is false, memory is affected according to the value of failure. These operations are atomic read-modify-write operations (5.1.2.4).

Your ARM, POWER-PC and other "LL/SC" devices do a Load-Link/Cmp/Store-Conditional sequence to implement atomic-cmp-exchange, where the Load-Link may or may not be _acquire and the Store-Conditional may or may not be _release.

So I can understand: success = _acq_rel and failure = _acquire.

What I cannot get my head around is (amongst others): success = _acq_rel and failure = _relaxed. Surely, in order to achieve _acq_rel, the Load-Link must be _acquire ? If the cmp fails, surely it is then too late to downgrade to _relaxed ?

What is the intended meaning of the combinations of success and failure parameters (when they are not equal) ?

[It's always possible that I have been bamboozled by the Standard, and that the failure memory-order may, in fact, only be the read half of whatever the success memory-order is.]


Solution

  • One way of implementing an acquire load in asm on some ISAs is a plain load followed by a fence, e.g. on ISAs like PowerPC or ARM before ARMv8 introduced ldar / ldaxr. This later fence can be skipped if the failure ordering doesn't include acquire.

    An LL/SC CAS_weak might look something like this in pseudo-asm for no real ISA:

       ll    r0, mem
       cmp   r0, r1
       jne  .fail        # early-out compare fail
       sc    r2, mem     # let's pretend this sets CF condition code on SC failure
       jc   .fail        # jump if SC failed
    
       lwsync              # LoadLoad, StoreStore, and LoadStore but not StoreLoad
       ... CAS success path
    
    .fail:   # we jump here without having executed any barriers
       ... CAS failure path
    

    This (I think) could be a valid implementation of mem.compare_exchange_weak(r1, r2, mo_acquire, mo_relaxed); on some kinds of machines.

    It's only an acquire operation so the whole RMW can reorder with earlier operations (the nature of LL/SC keeps them stuck together in the global order). There's no barrier ahead of the operation, only after.

    lwsync is a PowerPC barrier instruction which blocks all reordering except StoreLoad (i.e. doesn't flush the store buffer). https://preshing.com/20120913/acquire-and-release-semantics/ and https://preshing.com/20120930/weak-vs-strong-memory-models/


    To implement CAS(..., acq_rel, relaxed) on most ISAs, we'd also run a barrier before the LL/SC (to separate it from earlier operations, creating the release part). That would execute even on the failure path, but would not create acquire semantics. It wouldn't separate the load from later operations.

    AFAIK you wouldn't want to put the fence between LL and SC to maybe skip if the compare failed. That would lengthen the transaction and give more chance for it to fail from activity of other threads.

    As always when implementing the C++ memory model on top of real asm, you do something that's as strong as necessary but no stronger, given the limits of what primitives the underlying ISA provides. Most ISAs won't make it possible for the CAS(acq_rel, relaxed) failure path to actually be as cheap as a plain relaxed load, either because it's impossible or because it would hurt the normal-case performance. But on some it can still be less expensive than if the failure side needed to have acquire semantics.

    Some ISAs (like ARM) apparently only have full barriers (dsb ish) so even an acq_rel ends up draining the store buffer. (So ARMv8 introducing acquire-load and sequential-release stores was very nice, since it perfectly matches C++ seq-cst semantics and can potentially be much cheaper than a barrier.)