Search code examples
c++multithreadingperformanceatomicstdatomic

Is there any performance difference in just reading an atomic variable compared to a normal variable?


int i = 0;
if(i == 10)  {...}  // [1]

std::atomic<int> ai{0};
if(ai == 10) {...}  // [2]
if(ai.load(std::memory_order_relaxed) == 10) {...}  // [3]

Is the statement [1] any faster than the statements [2] & [3] in a multithreaded environment?
Assume that ai may or may not be written in another thread, when [2] & [3] are executing.

Add-on: Provided that accurate value of the underlying integer is not a necessity, which is the fastest way to read an atomic variable?


Solution

  • It depends on the architecture, but in general loads are cheap, paired with a store with a strict memory ordering can be expensive though.

    On x86_64, loads and stores of up to 64-bits are atomic on their own (but read-modify-write is decidedly not).

    As you have it, the default memory ordering in C++ is std::memory_order_seq_cst, which gives you sequential consistency, ie: there's some order that all threads will see loads/stores occurring in. To accomplish this on x86 (and indeed all multi-core systems) requires a memory fence on stores to ensure that loads occurring after the store read the new value.

    Reading in this case does not require a memory fence on strongly-ordered x86, but writing does. On most weakly-ordered ISAs, even seq_cst reading would require some barrier instructions, but not a full barrier. If we look at this code:

    #include <atomic>
    #include <stdlib.h>
    
    int main(int argc, const char* argv[]) {
        std::atomic<int> num;
    
        num = 12;
        if (num == 10) {
            return 0;
        }
        return 1;
    }
    

    compiled with -O3:

       0x0000000000000560 <+0>:     sub    $0x18,%rsp
       0x0000000000000564 <+4>:     mov    %fs:0x28,%rax
       0x000000000000056d <+13>:    mov    %rax,0x8(%rsp)
       0x0000000000000572 <+18>:    xor    %eax,%eax
       0x0000000000000574 <+20>:    movl   $0xc,0x4(%rsp)
       0x000000000000057c <+28>:    mfence 
       0x000000000000057f <+31>:    mov    0x4(%rsp),%eax
       0x0000000000000583 <+35>:    cmp    $0xa,%eax
       0x0000000000000586 <+38>:    setne  %al
       0x0000000000000589 <+41>:    mov    0x8(%rsp),%rdx
       0x000000000000058e <+46>:    xor    %fs:0x28,%rdx
       0x0000000000000597 <+55>:    jne    0x5a1 <main+65>
       0x0000000000000599 <+57>:    movzbl %al,%eax
       0x000000000000059c <+60>:    add    $0x18,%rsp
       0x00000000000005a0 <+64>:    retq
    

    We can see that the read from the atomic variable at +31 doesn't require anything special, but because we wrote to the atomic at +20, the compiler had to insert an mfence instruction afterwards which ensures that this thread waits for its store to become visible before doing any later loads. This is expensive, stalling this core until the store buffer drains. (Out-of-order exec of later non-memory instructions is still possible on some x86 CPUs.)

    If we instead us a weaker ordering (such as std::memory_order_release) on the write:

    #include <atomic>
    #include <stdlib.h>
    
    int main(int argc, const char* argv[]) {
        std::atomic<int> num;
    
        num.store(12, std::memory_order_release);
        if (num == 10) {
            return 0;
        }
        return 1;
    }
    

    Then on x86 we don't need the fence:

       0x0000000000000560 <+0>:     sub    $0x18,%rsp
       0x0000000000000564 <+4>:     mov    %fs:0x28,%rax
       0x000000000000056d <+13>:    mov    %rax,0x8(%rsp)
       0x0000000000000572 <+18>:    xor    %eax,%eax
       0x0000000000000574 <+20>:    movl   $0xc,0x4(%rsp)
       0x000000000000057c <+28>:    mov    0x4(%rsp),%eax
       0x0000000000000580 <+32>:    cmp    $0xa,%eax
       0x0000000000000583 <+35>:    setne  %al
       0x0000000000000586 <+38>:    mov    0x8(%rsp),%rdx
       0x000000000000058b <+43>:    xor    %fs:0x28,%rdx
       0x0000000000000594 <+52>:    jne    0x59e <main+62>
       0x0000000000000596 <+54>:    movzbl %al,%eax
       0x0000000000000599 <+57>:    add    $0x18,%rsp
       0x000000000000059d <+61>:    retq   
    

    Note though, if we compile this same code for AArch64:

       0x0000000000400530 <+0>:     stp  x29, x30, [sp,#-32]!
       0x0000000000400534 <+4>:     adrp x0, 0x411000
       0x0000000000400538 <+8>:     add  x0, x0, #0x30
       0x000000000040053c <+12>:    mov  x2, #0xc
       0x0000000000400540 <+16>:    mov  x29, sp
       0x0000000000400544 <+20>:    ldr  x1, [x0]
       0x0000000000400548 <+24>:    str  x1, [x29,#24]
       0x000000000040054c <+28>:    mov  x1, #0x0
       0x0000000000400550 <+32>:    add  x1, x29, #0x10
       0x0000000000400554 <+36>:    stlr x2, [x1]
       0x0000000000400558 <+40>:    ldar x2, [x1]
       0x000000000040055c <+44>:    ldr  x3, [x29,#24]
       0x0000000000400560 <+48>:    ldr  x1, [x0]
       0x0000000000400564 <+52>:    eor  x1, x3, x1
       0x0000000000400568 <+56>:    cbnz x1, 0x40057c <main+76>
       0x000000000040056c <+60>:    cmp  x2, #0xa
       0x0000000000400570 <+64>:    cset w0, ne
       0x0000000000400574 <+68>:    ldp  x29, x30, [sp],#32
       0x0000000000400578 <+72>:    ret
    

    When we write to the variable at +36, we use a Store-Release instruction (stlr), and loading at +40 uses a Load-Acquire (ldar). These each provide a partial memory fence (and together form a full fence).

    You should only use atomic when you have to reason about the access ordering on the variable. To answer your add-on question, use std::memory_order_relaxed for the memory to read the atomic, with no guarantees on synchronizing with writes. Only atomicity is guaranteed.