I am using a GTX9800 which reports available shared memory as 16384 bytes
Given the following kernel code, run with T
= int
(4 bytes)
template <typename T>
__global__ void foo(unsigned n, T *x)
{
unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ T sx[4096];
x[idx] = 0;
}
I get the expected result, which is that the array x
which is initially non-zero will be filled with zeros.
However, adding a line of code that does not do anything:
template <typename T>
__global__ void foo(unsigned n, T *x)
{
unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ T sx[4096];
sx[0] = 0;
x[idx] = 0;
}
Now x
does not contain any zeros at all after calling the kernel!
However, if I change the size of sx
to be <= 4088 I get the expected result again.
Whats going on? I am rather confused.
Edit:
corrected typo: changed 16384 "KB" to "bytes"
The size of the shared memory on compute capability 1.x devices is 16384 bytes per SM, not kilobytes.
Furthermore every block will consume 16 bytes for internal purposes (storing block indices etc.), plus additional storage for the kernel's arguments.
So unfortunately you cannot use the full 16kb of shared memory in a single block.
For higher compute capabilities this data will be stored elsewhere (constant memory and special registers), so the entire shared memory is available there.