Search code examples
cudansightgpu-shared-memory

Cuda Shared memory shown as register in Nsight


I declared shared memory and tried to trace it with Nsight 2.2 for Visual Studio 2010. I'm using CUDA 4.2 with a Quadro 5000.

in my kernel.cu:

extern __shared__ ushort2 sampleGatheringSM[];

The code launching the kernel:

sampleGathering_SM_size =dimBlock.x*dimBlock.y*4*sizeof(ushort2)*2; // = 10240
sampleGatheringKernel<<<dimGrid, dimBlock, sampleGathering_SM_size >>>(dev_image, dev_gradient, width, height);

When I look at the analysis activity on Nsight then "CUDA Launches", it tells me that:

  • Allocated Registers per block: 10240
  • Allocated Shared Memory per block: 0
  • Block Limit Reason: Registers

Did I allocate shared memory correctly? I don't understand how I could allocate registers.

EDIT:

It also tells me:

  • Register per threads: 32
  • Dynamic Shared memory per block: 0
  • Static shared memory per block: 0

Solution

  • The declaration of dynamic shared memory is correct. Nsight 2.2 Analysis Trace Report has a bug that only occurs for CUDA Trace Activities. Analysis Trace Activities run with the option Nsight | Options| Analysis | CUDA Kernel Trace Mode = Serialized and Analysis Profiler CUDA Activities display the correct value. This bug will be fixed in the next version of Nsight.