c++multithreading performance atomic stdatomic

Is there any performance difference in just reading an atomic variable compared to a normal variable?

int i = 0;
if(i == 10)  {...}  // [1]

std::atomic<int> ai{0};
if(ai == 10) {...}  // [2]
if(ai.load(std::memory_order_relaxed) == 10) {...}  // [3]

Is the statement [1] any faster than the statements [2] & [3] in a multithreaded environment?
Assume that ai may or may not be written in another thread, when [2] & [3] are executing.

Add-on: Provided that accurate value of the underlying integer is not a necessity, which is the fastest way to read an atomic variable?

Solution

It depends on the architecture, but in general loads are cheap, paired with a store with a strict memory ordering can be expensive though.

On x86_64, loads and stores of up to 64-bits are atomic on their own (but read-modify-write is decidedly not).

As you have it, the default memory ordering in C++ is std::memory_order_seq_cst, which gives you sequential consistency, ie: there's some order that all threads will see loads/stores occurring in. To accomplish this on x86 (and indeed all multi-core systems) requires a memory fence on stores to ensure that loads occurring after the store read the new value.

Reading in this case does not require a memory fence on strongly-ordered x86, but writing does. On most weakly-ordered ISAs, even seq_cst reading would require some barrier instructions, but not a full barrier. If we look at this code:

#include <atomic>
#include <stdlib.h>

int main(int argc, const char* argv[]) {
    std::atomic<int> num;

    num = 12;
    if (num == 10) {
        return 0;
    }
    return 1;
}

compiled with -O3:

   0x0000000000000560 <+0>:     sub    $0x18,%rsp
   0x0000000000000564 <+4>:     mov    %fs:0x28,%rax
   0x000000000000056d <+13>:    mov    %rax,0x8(%rsp)
   0x0000000000000572 <+18>:    xor    %eax,%eax
   0x0000000000000574 <+20>:    movl   $0xc,0x4(%rsp)
   0x000000000000057c <+28>:    mfence 
   0x000000000000057f <+31>:    mov    0x4(%rsp),%eax
   0x0000000000000583 <+35>:    cmp    $0xa,%eax
   0x0000000000000586 <+38>:    setne  %al
   0x0000000000000589 <+41>:    mov    0x8(%rsp),%rdx
   0x000000000000058e <+46>:    xor    %fs:0x28,%rdx
   0x0000000000000597 <+55>:    jne    0x5a1 <main+65>
   0x0000000000000599 <+57>:    movzbl %al,%eax
   0x000000000000059c <+60>:    add    $0x18,%rsp
   0x00000000000005a0 <+64>:    retq

We can see that the read from the atomic variable at +31 doesn't require anything special, but because we wrote to the atomic at +20, the compiler had to insert an mfence instruction afterwards which ensures that this thread waits for its store to become visible before doing any later loads. This is expensive, stalling this core until the store buffer drains. (Out-of-order exec of later non-memory instructions is still possible on some x86 CPUs.)

If we instead us a weaker ordering (such as std::memory_order_release) on the write:

#include <atomic>
#include <stdlib.h>

int main(int argc, const char* argv[]) {
    std::atomic<int> num;

    num.store(12, std::memory_order_release);
    if (num == 10) {
        return 0;
    }
    return 1;
}

Then on x86 we don't need the fence:

   0x0000000000000560 <+0>:     sub    $0x18,%rsp
   0x0000000000000564 <+4>:     mov    %fs:0x28,%rax
   0x000000000000056d <+13>:    mov    %rax,0x8(%rsp)
   0x0000000000000572 <+18>:    xor    %eax,%eax
   0x0000000000000574 <+20>:    movl   $0xc,0x4(%rsp)
   0x000000000000057c <+28>:    mov    0x4(%rsp),%eax
   0x0000000000000580 <+32>:    cmp    $0xa,%eax
   0x0000000000000583 <+35>:    setne  %al
   0x0000000000000586 <+38>:    mov    0x8(%rsp),%rdx
   0x000000000000058b <+43>:    xor    %fs:0x28,%rdx
   0x0000000000000594 <+52>:    jne    0x59e <main+62>
   0x0000000000000596 <+54>:    movzbl %al,%eax
   0x0000000000000599 <+57>:    add    $0x18,%rsp
   0x000000000000059d <+61>:    retq

Note though, if we compile this same code for AArch64:

   0x0000000000400530 <+0>:     stp  x29, x30, [sp,#-32]!
   0x0000000000400534 <+4>:     adrp x0, 0x411000
   0x0000000000400538 <+8>:     add  x0, x0, #0x30
   0x000000000040053c <+12>:    mov  x2, #0xc
   0x0000000000400540 <+16>:    mov  x29, sp
   0x0000000000400544 <+20>:    ldr  x1, [x0]
   0x0000000000400548 <+24>:    str  x1, [x29,#24]
   0x000000000040054c <+28>:    mov  x1, #0x0
   0x0000000000400550 <+32>:    add  x1, x29, #0x10
   0x0000000000400554 <+36>:    stlr x2, [x1]
   0x0000000000400558 <+40>:    ldar x2, [x1]
   0x000000000040055c <+44>:    ldr  x3, [x29,#24]
   0x0000000000400560 <+48>:    ldr  x1, [x0]
   0x0000000000400564 <+52>:    eor  x1, x3, x1
   0x0000000000400568 <+56>:    cbnz x1, 0x40057c <main+76>
   0x000000000040056c <+60>:    cmp  x2, #0xa
   0x0000000000400570 <+64>:    cset w0, ne
   0x0000000000400574 <+68>:    ldp  x29, x30, [sp],#32
   0x0000000000400578 <+72>:    ret

When we write to the variable at +36, we use a Store-Release instruction (stlr), and loading at +40 uses a Load-Acquire (ldar). These each provide a partial memory fence (and together form a full fence).

You should only use atomic when you have to reason about the access ordering on the variable. To answer your add-on question, use std::memory_order_relaxed for the memory to read the atomic, with no guarantees on synchronizing with writes. Only atomicity is guaranteed.