How can I launch a kernel with "as much dynamic shared mem as is possible"?

We know CUDA devices have very limited shared memory capacities, in the tens of Kilobytes only. And we also know kernels won't launch (typically? ever?) If you ask for too much shared memory. And we also know that the available shared memory is used both by the static allocations in code that you use and the dynamically-allocated shared memory.

Now, cudaGetDeviceProperties() gives us the overall space we have. But, given a function symbol, is it possible to determine how much statically-allocated shared memory it would use, so that I can "fill up" the shared mem to full capacity on launch? If not, is there a possibility of having CUDA take care of this for me somehow?

Solution

The runtime API has a function cudaFuncGetAttributes which will allow you to retrieve the attributes of any kernel in the current context, including the amount of static shared memory per block which the kernel will consume. You can do the math yourself with that information.