Search code examples
cudansightgpu-shared-memory

Only half of the shared memory array is assigned


I see only half of the shared memory array is assigned, when I use Nsight stepped after s_f[sidx] = 5;

__global__ void BackProjectPixel(double* val,   
                                    double* projection,
                                    double* focalPtPos,
                                    double* pxlPos,
                                    double* pxlGrid,
                                    double* detPos, 
                                    double *detGridPos,
                                    unsigned int nN,
                                    unsigned int nS,
                                    double perModDetAngle,
                                    double perModSpaceAngle,
                                    double perModAngle)                 
{
    const double fx = focalPtPos[0];
    const double fy = focalPtPos[1];
    
    //extern __shared__ double s_f[64]; // 

    __shared__ double s_f[64]; // 

    unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;
    unsigned int j = (blockIdx.y * blockDim.y) + threadIdx.y;
    unsigned int idx = j*nN + i;

    unsigned int sidx = threadIdx.y * blockDim.x + threadIdx.x;

    unsigned int threadsPerSharedMem = 64;

    if (sidx < threadsPerSharedMem)
    {
        s_f[sidx] = 5;
    }

    __syncthreads();

    //double * angle;
    //
    
    if (sidx < threadsPerSharedMem)
    {
    
        s_f[idx] = TriPointAngle(detGridPos[0], detGridPos[1],fx, fy, pxlPos[idx*2], pxlPos[idx*2+1], nN);
    }



}

Here is what I observed

table of values from a debugger

I am wondering why there are only thirty-two 5? Shouldn't there be sixty-four 5 in s_f? Thanks.


Solution

  • I am wondering why there are only thirty-two 5?

    There are 32 fives because as mete says, kernels are executed simultaneously only by groups of threads of size 32, so called warps in CUDA terminology.

    Shouldn't there be sixty-four 5 in s_f?

    There will be 64 fives after the synchronization barrier, i.e. __syncthreads(). So if you place your breakpoint on the first instruction after the __syncthreads() call, you'll see all fives. Thats because by that time all the warps from one block will finish execution of all the code prior to __syncthreads().

    How can I see all warps with Nsight?

    You can see values for all the threads easily by putting this into watchfield:

    s_f[sidx]
    

    Although sidx value may become undefined due to optimizations, so I would better watch the value of:

    s_f[((blockIdx.y * blockDim.y) + threadIdx.y) * nN + (blockIdx.x * blockDim.x) + threadIdx.x]
    

    And indeed, if you want to investigate values for particular warp, then as Robert Crovella points out, you should use conditional breakpoints. If you want to break within the second warp, then something like this should work in case of two dimensional grid of two dimensional block (which I presume you are using):

    ((blockIdx.x + blockIdx.y * gridDim.x) * (blockDim.x * blockDim.y) + (threadIdx.y * blockDim.x) + threadIdx.x) == 32
    

    Because 32 is the index of the first thread within the second warp. For other combinations of block and grid dimensions see this useful cheatsheet.