I'm trying to process array of big structures with CUDA 2.0 (NVIDIA 590). I'd like to use shared memory for it. I've experimented with CUDA occupancy calculator, trying to allocate maximum shared memory per thread, so that each thread can process whole element of array. However maximum of (shared memory per block) / (threads per block) I can see in calculator with 100% Multiprocessor load is 32 bytes, which is not enough for single element (on the order of magnitude). Is 32 bytes a maximum possible value for (shared memory per block) / (threads per block)? Is it possible to say which alter4native is preferable - allocate part of array in global memory or just use underloaded multiprocessor? Or it can only be decided by experiment? Yet another alternative I can see is to process array in several passes, but it looks like a last resort. That is first time I'm trying something really complex with CUDA, so I could be missing some other options...
There are many hardware limitations you need to keep in mind when designing a CUDA kernel. Here are some of the constraints you need to consider:
Whichever of these limits you hit first becomes a constraint that limits your occupancy (is maximum occupancy what you are referring to by "100% Multiprocessor load"?). Once you reach a certain threshold of occupancy, it becomes less important to pay attention to occupancy. For example, occupancy of 33% does not mean that you are only able to achieve 33% of the maximum theoretical performance of the GPU. Vasily Volkov gave a great talk at the 2010 GPU Technology Conference which recommends not worrying too much about occupancy, and instead trying to minimize memory transactions by using some explicit caching tricks (and other stuff) in the kernel. You can watch the talk here: http://www.gputechconf.com/gtcnew/on-demand-GTC.php?sessionTopic=25&searchByKeyword=occupancy&submit=&select=+&sessionEvent=&sessionYear=&sessionFormat=#193
The only real way to be sure that you are using a kernel design that gives best performance is to test all the possibilities. And you need to redo this performance testing for each type of device you run it on, because they all have different constraints in some way. This can obviously be tedious, especially when the different design patterns result in fundamentally different kernels. I get around this to some extent by using a templating engine to dynamically generate kernels at runtime according to the device hardware specifications, but it's still a bit of a hassle.