CUDA, could using shared memory improve my performance?

I'm implementing an algorithm to convert an image to grayscale using CUDA. I've got it working right now, but I'm looking for ways to improve performance. Right now, the entire color image is transferred to device memory, after which each thread calculates the gray pixel value by looking up the corresponding three (r,g,b) color values.

I have already made sure that the access to global memory is coalesced, though this did not really improve my performance (a 36 mb image took 0.003 s less after the memory access was coalesced...). Right now, I'm wondering whether using shared memory could improve my performance. Here's what I have right now:

My CUDA kernel:

__global__ void darkenImage(const unsigned char * inputImage,
    unsigned char * outputImage, const int width, const int height, int iteration){

  int x = ((blockIdx.x * blockDim.x) + (threadIdx.x + (iteration * MAX_BLOCKS * nrThreads))) * 3;

  if(x+2 < (3 * width*height)){
    float grayPix = 0.0f;
    float r = static_cast< float >(inputImage[x]);
    float g = static_cast< float >(inputImage[x+1]);
    float b = static_cast< float >(inputImage[x+2]);

    grayPix = __fadd_rn(__fadd_rn(__fmul_rn(0.3f, r),__fmul_rn(0.59f, g)), __fmul_rn(0.11f, b));
    grayPix = fma(grayPix,0.6f,0.5f);


    outputImage[(x/3)] = static_cast< unsigned char >(grayPix);
  }
}

My question really is, because there is no memory shared between any two threads, using shared memory shouldn't really help here now should it? Or did I misunderstand?

Regards,

Linus

Solution

If you are not using the same value more than once, using shared memory (cache) will not improve the performance. But you can try to remove the iteration parameter and process more data with each block. Try to have a single kernel launch and a loop within the kernel so that each thread can calculate more than one output data.