Search code examples
cudagpu-shared-memory

CUDA shared array not getting values?


I am trying to implement simple parallel reduction. I am using the code from the CUDA SDK. But somehow there is a problem in my kernel as the shared array is not getting values of the global array and its all zeroes.

extern __ shared __ float4 sdata[];

// each thread loads one element from global to shared mem

unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

sdata[tid] = dev_src[i];

__syncthreads();

// do reduction in shared mem

for(unsigned int s = 1; s < blockDim.x; s *= 2) {
    if(tid % (2*s) == 0){
        sdata[tid] += sdata[tid + s];
    }
    __syncthreads();
}

// write result for this block to global mem
if(tid == 0)
    out[blockIdx.x] = sdata[0];

Edit:

Ok I got it working by removing the extern keyword and making the shared array a constant size like 512. I am in good shape now. Maybe someone can explain why it was not working with the extern keyword.


Solution

  • I think I know why this is happening as I have faced this before. How are you launching the kernel?

    Remember in the launch kernel<<<blocks,threads,sharedMemory>>> the sharedMemory should be the size of the shared memory in bytes. So, if you are declaring for 512 elements, the third parameter should be 512 * sizeof(float4). I think you are just calling as below, which is wrong

    kernel<<<blocks,threads,512>>>   // this is wrong