On the host side, I can save the amount of dynamic shared memory I intend to launch a kernel with, and use it. I can even pass that as an argument to the kernel. But - is there a way to get it directly from device code, without help from the host side? That is, have the code for a kernel determine, as it runs, how much dynamic shared memory it has available?
Yes, there's a special register holding that value, named %dynamic_smem_size
. You can obtain this register's value in your CUDA C/C++ code by wrapping some inline PTX with a getter function:
__device__ unsigned dynamic_smem_size()
{
unsigned ret;
asm volatile ("mov.u32 %0, %dynamic_smem_size;" : "=r"(ret));
return ret;
}
You can similarly obtain the total size of allocated shared memory (static + dynamic) from the register %total_smem_size
.
Note: reading a special register does cost a few cycles. If you can compute this value apriori, you'll save that read, plus you might squeeze something out of the optimzier knowing that value.