Search code examples
cudasynchronizationgpu-shared-memory

cuda should a unique block index and its calculation be moved to shared memory?


Lets say we have a kernel that needs a unique block index and should scale as much as possible so it should scale across a 3D grid.

The calculation looks pretty complex, let just one thread perform it and store it in shared memory. Is it a good idea?
In all literature its always stored in registers, but whats the drawback in shared memory?

I am not sure but is shared memory read write access 4 cycles while a register is 1 cycle?

Instead of:

__global__ kernel()
{
    //get unique 3D block index
    const unsigned long long int blockId = blockIdx.x //1D
        + blockIdx.y * gridDim.x //2D
        + gridDim.x * gridDim.y * blockIdx.z; //3D
}

Maybe use: (assuming only x-dimension of block is used)

__global__ kernel()
{
    __shared__ unsigned long long int blockId_s;

    if(threadIdx.x == 0)
        blockId_s = blockIdx.x //1D
            + blockIdx.y * gridDim.x //2D
            + gridDim.x * gridDim.y * blockIdx.z; //3D
    __syncthreads();
}

This would save one register per thread which are costly at compute capability 1.x.

I have no tests and no idea if its good or bad for performance. The one more available register on cc 1.x is an argument but performance should be little slower with a __syncthreads() statement.


Solution

  • In this case the answer was yes, but mostly because the resulting code used fewer registers and this allowed for higher overall occupancy and some speed up. This was despite the code having some branch divergence in the first warp, plus a synchronisation primitive.

    However, this shouldn't be considered to be a universal rule and the only way to be sure is to write the code and benchmark it on the target GPU.

    This answer was added from comments as a community wiki entry to get this question off of the unanswered queue.