Search code examples
c++armx86-64atomicstdatomic

How is atomic_flag implemented?


How is atomic_flag is implemented? It feels to me that on x86-64 it is equivalent to atomic_boolanyway, but it is just a guess. Might the x86-64 implementation be any different from arm or x86?


Solution

  • Yeah, on normal CPUs where atomic<bool> and atomic<int> are also lock-free, it's pretty much like atomic<bool>, using the same instructions. (x86 and x86-64 have the same set of atomic operations available.)

    You might think that it would always use x86 lock bts or lock btr to set / reset (clear) a single bit, but it can be more efficient to do other things (especially for a function that returns a bool instead of branching on it). The object is a whole byte so you can just store or exchange the whole byte. (And if the ABI guarantees that the value is always 0 or 1, you don't have to booleanize it before returning the result as a bool)

    GCC and clang compile test_and_set to a byte exchange, and clear to a byte store of 0. We get (nearly) identical asm for atomic_flag test_and_set as f.exchange(true);

    #include <atomic>
    
    bool TAS(std::atomic_flag &f) {
        return f.test_and_set();
    }
    
    bool TAS_bool(std::atomic<bool> &f) {
        return f.exchange(true);
    }
    
    
    void clear(std::atomic_flag &f) {
        //f = 0; // deleted
        f.clear();
    }
    
    void clear_relaxed(std::atomic_flag &f) {
        f.clear(std::memory_order_relaxed);
    }
    
    void bool_clear(std::atomic<bool> &f) {
        f = false; // deleted
    }
    

    On Godbolt for x86-64 with gcc and clang, and for ARMv7 and AArch64.

    ## GCC9.2 -O3 for x86-64
    TAS(std::atomic_flag&):
            mov     eax, 1
            xchg    al, BYTE PTR [rdi]
            ret
    TAS_bool(std::atomic<bool>&):
            mov     eax, 1
            xchg    al, BYTE PTR [rdi]
            test    al, al
            setne   al                      # missed optimization, doesn't need to booleanize to 0/1
            ret
    clear(std::atomic_flag&):
            mov     BYTE PTR [rdi], 0
            mfence                          # memory fence to drain store buffer before future loads
            ret
    clear_relaxed(std::atomic_flag&):
            mov     BYTE PTR [rdi], 0      # x86 stores are already mo_release, no barrier
            ret
    bool_clear(std::atomic<bool>&):
            mov     BYTE PTR [rdi], 0
            mfence
            ret
    

    Note that xchg is also an efficient way to do a seq_cst store on x86-64, usually more efficient than the mov + mfence that gcc uses. Clang uses xchg for all of these (except the relaxed store).

    Amusingly, clang re-booleanizes to 0/1 after the xchg in atomic_flag.test_and_set(), but GCC instead does it after atomic<bool>. clang does a weird and al,1 in TAS_bool, which would treat values like 2 as false. It seems totally pointless; the ABI guarantees that a bool in memory is always stored as a 0 or 1 byte.

    For ARM, we have ldrexb / strexb exchange retry loops, or just strb + dmb ish for the pure store. Or AArch64 can use stlrb wzr, [x0] for clear or assign-false to do a sequential-release store (of the zero-register) without needing a barrier.