Search code examples
cudaopenclgpumemory-barriersbarrier

Equivalent of barrier(CLK_GLOBAL_MEM_FENCE) in CUDA


What is the equivalent of calling barrier(CLK_GLOBAL_MEM_FENCE) (OpenCL), in CUDA?

It should wait until all the threads in the block have reached the barrier. And global memory accesses done before the barrier, should become visible by all the threads in the block after the barrier.

Is it sufficient to do __syncthreads(), or is __threadfence() necessary for the global memory fence, or both? If so, in what order should both be called?


Solution

  • You may wish to read the CUDA documentation here both on memory fences and execution barriers.

    __syncthreads() is both an execution barrier (for threads in the block) as well as a memory fence for both shared and global memory operations. For global memory operations, the fence enforcement is with respect to threads in the block only.

    Fencing here has a specific meaning: Memory operations occurring before the fence, will have a visibility to other threads in the block as actually occurring (ie. visible) prior to operations occurring after the fence. Please review the documentation already linked for detailed descriptions of this.

    If you also require (global memory) fencing outside the block (i.e. for all threads in the grid) then you must also use __threadfence(). Unlike __syncthreads(), __threadfence() is not an execution barrier of any kind. It is a memory fence only.

    The order of calling should not matter (as long as there are no intervening operations). A fence is a dividing point, and the functionality of __threadfence() is a superset of the fencing functionality contained in __syncthreads()