c++cpu-architecture atomic memory-barriers stdatomic

How to use memory barriers (instead of fetch_add) to make addition atomic and thread safe

The following is a simple example of two threads doing addition operations.

#include <iostream>
#include <atomic>
#include <thread>

std::atomic<int> atomic_int(0);
int count  = 1000000;

int func1(){
    for (int i = 0; i < count; ++i){
        int val = atomic_int.load(std::memory_order_acquire);
        atomic_int.store(val + 1, (std::memory_order_release));
    }
    return 0;
}

int main()
{
    std::thread t1(func1);

    std::thread t2(func1);
    t2.join();
    t1.join();

    printf("%d\n", atomic_int.load());

    std::cout << "Value stored in atomic object: " << atomic_int << std::endl;

    return 0;
}

This code does not do what I hoped.

My original intention was to have both threads add one to atomic_int. After each having fun 1 million times, the total value should be 2 million.

Because we know that the CPU will cache the modified value, so I changed the variable to atomic and used load and store at the same time. My original understanding was that load reads from the main memory, and store writes to the main memory. In this case, even after the first thread writes the variable, the second scene originally read the cache, but now because it reads from the main memory , then the value should also be normal.

Later, I thought of a situation. Is it possible that both threads are currently reading the variable value and then modifying the variable one after another. In this way, the variable value arriving later is equivalent to overwriting the modified value of the previous thread. Not plus one

question

In addition to using fetch_add, can I use memory barrier technology to solve this problem?
Is there a way to observe detailed changes in cpu read and write values, such as changes in cpu cache and failure queue, so as to better observe the effects of various barrier functions?

I tried inserting a line of memory barrier function but it didn't have any effect

#include <iostream>
#include <atomic>
#include <thread>

std::atomic<int> atomic_int(0);
int count  = 1000000;

int func1(){
    for (int i = 0; i < count; ++i){
        int val = atomic_int.load(std::memory_order_acquire);
        std::atomic_thread_fence(std::memory_order_seq_cst); // 插入内存屏障
        atomic_int.store(val + 1, (std::memory_order_release));
    }
    return 0;
}

int main()
{
    std::thread t1(func1);

    std::thread t2(func1);
    t2.join();
    t1.join();

    printf("%d\n", atomic_int.load());

    std::cout << "Value stored in atomic object: " << atomic_int << std::endl;

    return 0;
}

Solution

Is it possible that both threads are currently reading the variable value and then modifying the variable one after another. Not plus one

Yes, this is precisely the problem with separately-atomic load/add/store, and why no amount of ordering of each thread's operations separately can make that equivalent to an atomic RMW. Or worse, multiple increments by the other thread could get stepped on if this thread stalls for a long time between load and store. (e.g. OS puts it to sleep to run something else.)

With pathological scheduling on a single-core machine, the minimum value for atomic_int after both threads finish is something like 2, even with everything happening sequentially-consistently (where program behaviour is explainable as some interleaving of operations from all threads.) No memory-reordering is necessary to explain that.

You need atomic_fetch_add. Barriers can't create atomicity, neither RMW like this case, nor a wider atomic load transaction out of multiple pure-loads or pure-stores.

In terms of a system with coherent shared cache like normal modern CPUs, atomic_thread_fence() just limits or prevents reordering of this core's accesses to that coherent cache. It doesn't directly interact with fences in other threads, only by letting loads and stores creating syncs-with relationships. It effectively strengthens weaker operations, e.g. load(relaxed); fence(acquire); is at least as strong as a load(acquire). But that's all you get.

At most, memory barriers can help recover sequential consistency, making your program behaviour explainable in terms of some interleaving of source order. But they don't go beyond that, in the abstract machine or in practice on real hardware. And interleaving of source operations makes it possible to step on increments from the other thread.

You can't roll your own lock-free¹ atomic RMWs out of pure loads and pure stores because nothing can stop a different thread from storing in between them. In C++ terms, that store appearing in the modification-order of your atomic_int between the value you loaded and the value you later store.

See Is incrementing an int effectively atomic in specific cases? for the details of how fetch_add works in hardware on x86 (and I assume ARMv8.1), and notice that it does things which a separate load + barrier + store can't do. Specifically, stopping this core from replying to MESI share requests between the load and store parts of lock xadd, so this core keeps the cache line in MESI exclusive state for the duration, guaranteeing that no other threads can write it during the operation. And any reads of this value were obtained before we got exclusive ownership of the cache line.

(ISAs that compile fetch_add to an LL/SC retry loop will instead have the store-conditional fail if this core lost exclusive ownership of the cache line since the load-linked. This is another way to only let the store actually happen if this core had exclusive ownership of the cache line for the whole atomic RMW.)

Even relaxed atomic loads and stores on the same object can't reorder with each other from the same thread (the C++ standard calls these the coherency rules), so you're gaining exactly nothing by putting a barrier there, both in ISO C++ and on real hardware. (Except maybe for ordering these operations wrt. ops by this thread on other objects.)

A compiler could legally compile your atomic load/add/store to x86-64 add dword ptr [rip + atomic_int], 1, the exact same asm it uses for plain non-atomic int increments, which are in fact non-atomic on real hardware with multiple hardware cores. (Or to separate load and store instructions, which wouldn't even be atomic on a uniprocessor machine: a context-switch could happen between load and store.)

Misconception about CPU caches and coherency

now because it reads from the main memory , then the value should also be normal.

That's not the actual mechanism for what's going on. std::atomic doesn't bypass CPU cache, it just stops the compiler from keeping values in registers (which some people confusingly call "caching" them.)

All C++ implementations only run std::thread across cores that are cache-coherent with each other. So the only thing needed for a store to be visible to loads in other cores promptly is for it to actually happen in asm, not getting optimized away.

Footnote 1: You can roll your own lock with just seq_cst loads + stores, e.g. https://en.wikipedia.org/wiki/Peterson%27s_algorithm or a different algorithm that works for more threads. If all threads accessing this object choose to respect the lock, i.e. do mutual exclusion around accesses to the variable, then increments will be thread safe.

(With atomic_int you could only lock around writes and RMWs, and still let reads just read without taking the lock, as long as your critical sections never ever store a temporary value, only a final value that's ok for other threads to read. But this is still basically readers/writers locking, not lock-free by definition, and not compatible with other threads that use fetch_add without taking the lock.)

Or you can roll your own fetch_add with a compare_exchange_weak retry loop, but don't; that's slower than using a single hardware operation if available. A CAS retry loop is still lock-free but not wait-free.

Is there a way to observe detailed changes in cpu read and write values, such as changes in cpu cache and failure queue, so as to better observe the effects of various barrier functions?

No, except in a simulator.

I guess on an LL/SC machine, like ARM with ldrex / strex, you could count retries to count how many times this core actually lost MESI exclusive ownership of a cache line and had to start over on its atomic RMW.

On Intel CPUs, there's a machine_clears.memory_ordering perf event you can count with perf stat, which is related to other threads writing our cache lines after a speculative early load, but before a load was architecturally allowed. Why flush the pipeline for Memory Order Violation caused by other logical processors?

And you can run performance experiments like What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings? to see how much slowdown you get from different logical cores vs. different physical cores writing the same line, and look at various HW performance counters while you do.

Semi-related: