Search code examples
cuda

why we don't need to use volatile variable when using __syncthreads


Everything is in the question. I understand why we need variables to be volatile when we are using __threadfence_block and its similar functions :

Note that for this ordering guarantee to be true, the observing threads must truly observe the memory and not cached versions of it; this is ensured by using the volatile keyword as detailed in Volatile Qualifier.

However I wonder why we do not need variable to be volatile when we are using __syncthreads function


Solution

  • According to the programming guide, __syncthreads() is both an execution barrier and a memory fence:

    waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block.

    The memory fencing function (i.e. "visibility") "forces" all updates to shared and global memory to be visible to other threads.

    I assume this is what you are asking about. I don't think making blanket statements like "you don't need to use volatile when using __syncthreads()" is a sensible idea. It will depend on the code. But in some situations, for example the classical parallel reduction, using a __syncthreads() at every step in the block-wide reduction will mean that the shared memory used for such a reduction need not be marked with volatile.

    Since __syncthreads() is both an execution barrier and a memory fence, there are certain statements we can make about __syncthreads() usage that would not be applicable to just usage of __threadfence() alone.

    Suppose I have this code:

    __global__ void k(int *data){
      ...
      *data = 1;
      __syncthreads();
      if (*data == 1){
        ...}
      ...
    }
    

    In this case, any thread in a particular block executing the if statement is guaranteed to see *data as 1. There are two components to this:

    1. __syncthreads() is a (threadblock-wide) memory fence. It forces any thread (in the block) that has written the value, to make that value visible. This effectively means, since this is a block-wide memory fence, that the value written at least has populated the L1 cache.

    2. __syncthreads() is a (threadblock-wide) execution barrier. It forces all threads (in the block) to reach the barrier before any can proceed. This execution ordering behavior means that by the time any thread executes the above if-statement, the guarantee in item 1 above is in effect.

    Note that there is a subtle distinction here. Other threads, in other blocks, at other points in the code, may or may not see the value written by a different block.

    Only when we have combined execution synchronization and memory fencing, can we be certain that values populated by one thread are truly visible to another thread. And without use of cooperative groups, CUDA provides no mechanism to synchronize execution across separate blocks.

    __threadfence(), by itself, makes the value eventually visible, but without understanding relative order of execution between the writing thread and the reading thread, it's not possible to make guarantees just based on code inspection.

    Likewise volatile guarantees something similar to __threadfence() (for the writing thread), but is also somewhat different. __threadfence() guarantees that the writing thread will eventually push its data to the L2 (i.e. make it visible). volatile does something similar, but also guarantees that a reading thread will not read a "stale copy" in L1 but will go to L2 (at least) to fetch the current value, any time a read of that value occurs in the code.

    Note that there is never "invalidation" of L1 cache data triggered by device code activity on another SM. volatile effectively guarantees that a load will bypass the L1. volatile also guarantees that a store will go directly to the L2. __threadfence() does something similar to the latter (at least by the point that the thread has proceeded beyond the __threadfence()), but makes no guarantees about L1 state in other SMs, or how threads in other SMs will read the value.