CUDA memory model: why acquire fence is not needed to prevent load-load reordering?

I am reading the book "Programming Massively Parallel Processors" and noticed the below code snippets to achieve "domino-style" scan:

if (threadIdx.x == 0) {
    while(AtomicAdd(&flags[bid], 0) == 0) {}
    // ???? why do I not need thread fence here (for acquire semantic) to prevent load-load reordering between the loads for flags[bid] and scan_value[bid]? 
    previous_sum = scan_value[bid];
    scan_value[bid+1] = previous_sum + local_sum;
    __threadfence(); // why the "release" fence here is sufficient?
    atomicAdd(&flags[bid + 1], 1);
}

According to the book , the __threadfence() is required between the two writes to ensure scacn_value[bid+1] is written to global memory before flags[bid+1] is incremented, which seems to resemble the release semantic in C++ memory model (except that __threadfence has seq_cst semantics).

However, I am unable to understand why a similar __threadfence (or acquire fence) is not needed between the two reads to prevent load-load reordering (i.e. scan_value[bid] being read before flags[bid])?

Solution

TL;DR
As long as you can be sure that all loads/stores happen in source code order and are loaded from/stores to the L2 cache, you don't need barriers.
This works, until it doesn't (new compiler, new hardware) and then you're chasing down elusive bugs.
Acquire/release annotations guarantee correct behavior, the above code does not, it works purely by accident.

GPU vs CPU
Unlike a CPU core which can occupy 4mm² a CUDA core takes up less than 0.1mm² (at the latest 4NP node), a much smaller die area.
Because of this, the GPU cores do not have out-of-order execution, cache predictions, prefetching of data etc.
Whatever order the compiler puts the instructions in is the order in which they get executed.

Because of the while loop, the compiler cannot know how many times it must execute line 2, but it must execute it at least once get the the result of the atomicAdd.
Because execution cannot continue until the result of the atomicAdd has been read, the load in line 4 cannot happen before line 2.
In effect the loop creates a dependency, where previous_sum depends on flags[bid]. If you put the load not in a loop then the two are independent and then you need an acquire.
However, even though the load instruction of scan_value must execute after flags due to the loop, nothing prevents that later load to be fetched from L1, which may be out-of-date. An acquire on the flags load fixes this, now any later loads must come from the global L2.

Note that the write scan_value[bid+1] = previous_sum + local_sum; is not atomic, so the __threadfence is needed to ensure that other blocks will see the write before the last atomicAdd happens.
Without that __threadfence the scan_value could linger in the block's L1 cache, whilst the last atomicAdd went to the globally visible L2, only for scan_value to lag.
We can see this in the SASS final assembly, the load from scan_value is an LDG (i.e. bypass L1, load directly from L2).

C++ memory model
However, this code does not comply with the C++ memory model, the loop load should have an acquire in it, so as not to depend on some accidental detail of the nvcc compiler or current CUDA hardware.
So I would put an acquire in the while load of line 2.
(Alternatively, you can mark the flags and scan_value pointers as volatile, see below why this is a bad idea).

Using libcu++ you can more efficiently rewrite this snippet as follows (see godbolt for a side by side comparison: https://cuda.godbolt.org/z/PqMTqjKEj

#include <cuda/atomic>

//add boilerplate to make it compile in compiler explorer
__global__ void example(int* flags, int* scan_value) {
    const auto bid = blockIdx.x;
    auto local_sum = clock64(); //something random
    //your code again
    if (threadIdx.x == 0) {
        //make sure to specify thread_scope_device, or you'll get the slow default thread_scope_system, which includes the CPU.
        auto f0 = cuda::atomic_ref<int, cuda::thread_scope_device>(flags[bid]);
        //no more RMW action, acquire to ensure.
        while(f0.load(cuda::memory_order_acquire) == 0) {
            __nanosleep(); //only if there are other warps that can run on the same block.
        }
        //if this were not in a loop, the next line might get reordered.
        //however nothing prevents the next line to read from L1, thus causing a temporal mismatch with flags.
        //the load(acquire) forces this.

        auto previous_sum = scan_value[bid];
        scan_value[bid+1] = previous_sum + local_sum;
        auto f1 = cuda::atomic_ref<int, cuda::thread_scope_device>(flags[bid + 1]);
        //release, not seq_cst
        f1.fetch_add(1, cuda::memory_order_release);
    }
}

Note that in your while loop, only the first load needs to have acquire, subsequent loads can be relaxed. Such an optimization would only be sensible if a lot of time is spend in the while loop, but in that cause you should add backoff, like so:

if (f0.load(cuda::memory_order_acquire) == 0) {
    while (f0.load(cuda::memory_order_relaxed) == 0) {
        //allow other warps in the same block to run
        __nanosleep();
    }
}

The difference between relaxed and acquire:

relaxed: while(f0.load(cuda::memory_order_relaxed) == 0) {}

.L_x_4:
 LD.E.STRONG.GPU R4, [R2] 
 ISETP.NE.AND P0, PT, R4, RZ, PT 
 @!P0 BRA `(.L_x_4)

acquire: while(f0.load(cuda::memory_order_acquire) == 0) {}

.L_x_6:
 LD.E.STRONG.GPU R4, [R2] 
 ISETP.NE.AND P0, PT, R4, RZ, PT 
 CCTL.IVALL 
 @!P0 BRA `(.L_x_6)

What about volatile?
In CUDA volatile means something different from standard C++ volatile. For example, global pointers marked with it always go through the L2 cache. My advise is don't use it. Just stick with the atomics with C++ semantics provided in the <cuda/atomic> header.
Yes, you can write code that is fractionally faster that way, but the added complexity is just not worth it.
It prevents standard parallel C++ code from being ported between the CPU and GPU. When you use volatile. There is no way to validate your GPU implementation against an existing known good CPU implementation.
If you stick to libcu++ atomics, then the GPU source can be near identical to CPU code.

Warning! due to block scheduling this code deadlocks
If you have more blocks executing than the GPU has SMs (which is common), then your code can deadlock. For example if the blocks are scheduled in descending order.
The book shows a fix for this deadlock on the next page.
There are other issues with this code. It forces blocks to wait on one another, this is inefficient, because only one block can do work at a time, cutting efficiency by a factor 50 on a GPU with 50 SMs.

Don't use this code, do a cascading reduction instead.