Search code examples
cudashuffleptxgpu-warp

Warp shuffling for CUDA


I need to make a warp shuffling that look like this: warp shuffling

On this picture, the number of threads is limited to 8 to make it readable. If I read the Nvidia SDK and ptx manual, the shuffle instruction should do the job, specially the shfl.idx.b32 d[|p], a, b, c; ptx instruction.

From the manual I read:

Each thread in the currently executing warp will compute a source lane
index j based on input operands b and c and the mode. If the computed
source lane index j is in range, the thread will copy the input operand
a from lane j into its own destination register d;

So, providing proper values of b and c, I should be able to do it by writing a function like this (inspired from CUDA SDK __shufl primitive implementation).

  __forceinline__ __device __ float shuffle(float var){
   float ret;
   int srcLane = ???
   int c = ???
   asm volatile ("shfl.idx.b32 %0, %1, %2, %3;" : "=f"(ret) : "f"(var), "r"(srcLane), "r"(c));
  return ret;

}

If it is possible, what is the constant for srcLane and c? I am not able to determine them (I am using CUDA 8.0) .

Best,

Timocafe


Solution

  • I would recommend doing this with the CUDA intrinsic rather than with PTX (or inline ASM). However the following code demonstrates both methods:

    // cat t54.cu
    #include <stdio.h>
    
    __global__ void k(){
        int i = threadIdx.x;
        int j = i;
        if (i<4) j*=2;
        if ((i>3) && (i<8)) j-=(7-i);
        int k = __shfl_sync(0x0FFU, i+100, j);
        printf("lane: %d, result: %d\n", i, k);
    }
    
    __forceinline__ __device__ float shuffle(float var, int lane){
        float ret;
        int srcLane = lane;
        int c = 0x1F;
        asm volatile ("shfl.idx.b32 %0, %1, %2, %3;" : "=f"(ret) : "f"(var), "r"(srcLane), "r"(c));
        return ret;
    }
    
    __global__ void k1(){
        int i = threadIdx.x;
        int j = i;
        if (i<4) j*=2;
        if ((i>3) && (i<8)) j-=(7-i);
        float k = shuffle((float)(i+100), j);
        printf("lane: %d, result: %f\n", i, k);
    }
    
    
    int main(){
        k<<<1,8>>>();
        cudaDeviceSynchronize();
        k1<<<1,8>>>();
        cudaDeviceSynchronize();
    }
    
    $ nvcc -arch=sm_35 -o t54 t54.cu
    $ cuda-memcheck ./t54
    ========= CUDA-MEMCHECK
    lane: 0, result: 100
    lane: 1, result: 102
    lane: 2, result: 104
    lane: 3, result: 106
    lane: 4, result: 101
    lane: 5, result: 103
    lane: 6, result: 105
    lane: 7, result: 107
    lane: 0, result: 100.000000
    lane: 1, result: 102.000000
    lane: 2, result: 104.000000
    lane: 3, result: 106.000000
    lane: 4, result: 101.000000
    lane: 5, result: 103.000000
    lane: 6, result: 105.000000
    lane: 7, result: 107.000000
    ========= ERROR SUMMARY: 0 errors
    $
    

    Using the CUDA intrinsic (the first method) the only real task is to compute the source lane index. Based on your pattern I wrote some code to do that and put it in the variable j.