Search code examples
cudaptxptxas

What is the correct way to support `__shfl()` and `__shfl_sync()` instructions?


From my understanding, CUDA 10.1 removed the shfl instructions:

PTX ISA version 6.4 removes the following features:

Support for shfl and vote instructions without the .sync qualifier has been removed for .targetsm_70 and higher. This support was deprecated since PTX ISA version 6.0 as documented in PTX ISA version 6.2.

What is the correct way to support shfl future and past CUDA versions?

My current methods (shared below) result in the error using CUDA 10.1:

ptxas ... line 466727; error   : Instruction 'shfl' without '.sync' is not supported on .target sm_70 and higher from PTX ISA version 6.4
template <typename T>
__device__ static __forceinline__
T _shfl_up(T var, unsigned int delta, int width=WARPSIZE, unsigned mask=MEMBERMASK)
{
#if (__CUDACC_VER_MAJOR__ >= 9)
  var = __shfl_up_sync(mask, var, delta, width);
#else
  var = __shfl_up(var, delta, width);
#endif
  return var;
}

Also, I would like to add that one of the dependencies of my project is CUB and I believe they utilize the same method to split up _sync() and older shfl instructions. I am not sure what I am doing wrong.


Solution

  • I was doing the right thing, turns out another dependency didn't have support for sync, created a pull request for it: https://github.com/moderngpu/moderngpu/pull/32

    template <typename T>
    __device__ static __forceinline__
    T _shfl_up(T var, unsigned int delta, int width=WARPSIZE, unsigned mask=MEMBERMASK)
    {
    #if ( __CUDA_ARCH__ >= 300)
    #if (__CUDACC_VER_MAJOR__ >= 9)
      var = __shfl_up_sync(mask, var, delta, width);
    #else
      var = __shfl_up(var, delta, width);
    #endif
    #endif
      return var;
    }