How do I appropriately size and launch a CUDA grid?

First question:

Suppose I need to launch a kernel with 229080 threads on a Tesla C1060 which has compute capability 1.3.

So according to the documentation this machine has 240 cores with 8 cores on each symmetric multiprocessor for a total of 30 SMs.

I can use up to 1024 per SM for a total of 30720 threads running "concurrently".

Now if I define blocks of 256 threads that means I can have 4 blocks for each SM because 1024/256=4. So those 30720 threads can be arranged in 120 blocks across all SMs.

Now for my example of 229080 threads I would need 229080/256=~895 (rounded up) blocks to process all the threads.

Now lets say I want to call a kernel and I must use those 229080 threads so I have two options. The first one is to I divide the problem so that I call the kernel ~8 times in a for loop with a Grid of 120 blocks and 30720 threads each time (229080/30720). That way I make sure the device will stay occupied completely. The other option is to call the kernel with a Grid of 895 blocks for the entire 229080 threads on which case many blocks will remain idle until a SM finishes with the 8 blocks it has.

So which is the preferred option? does it make any difference for those blocks to remain idle waiting? do they take resources?

Second question

Let's say that within the kernel I'm calling I need to access non coalesced global memory so an option is to use shared memory.

I can then use each thread to extract a value from an array on global memory say global_array which is of length 229080. Now as I understand correctly you have to avoid branching when copying to shared memory since all threads on a block need to reach the syncthreads() call to make sure they all can access the shared memory.

The problem here is that for the 229080 threads I need exactly 229080/256=894.84375 blocks because there is a residue of 216 threads. I can round up that number and get 895 blocks and the last block will just use 216 threads.

But since I need to extract the value to shared memory from global_array which is of length 229080 and I can't use a conditional statement to prevent the last 40 threads (256-216) from accessing illegal addresses on global_array then how can I circumvent this problem while working with shared memory loading?

Solution

So which is the preferred option? does it make any difference for those blocks to remain idle waiting? do they take resources?

A single kernel is preferred according to what you describe. Threadblocks queued up but not assigned to an SM don't take any resources you need to worry about, and the machine is definitely designed to handle situations just like that. The overhead of 8 kernel calls will definitely be slower, all other things being equal.

Now as I understand correctly you have to avoid branching when copying to shared memory since all threads on a block need to reach the syncthreads() call to make sure they all can access the shared memory.

This statement is not correct on the face of it. You can have branching while copying to shared memory. You just need to make sure that either:

The __syncthreads() is outside the branching construct, or,
The __syncthreads() is reached by all threads within the branching construct (which effectively means that the branch construct evaluates to the same path for all threads in the block, at least at the point where the __syncthreads() barrier is.)

Note that option 1 above is usually achievable, which makes code simpler to follow and easy to verify that all threads can reach the barrier.

But since I need to extract the value to shared memory from global_array which is of length 229080 and I can't use a conditional statement to prevent the last 40 threads (256-216) from accessing illegal addresses on global_array then how can I circumvent this problem while working with shared memory loading?

Do something like this:

int idx = threadIdx.x + (blockDim.x * blockIdx.x);
if (idx < data_size)
  shared[threadIdx.x] = global[idx];
__syncthreads();

This is perfectly legal. All threads in the block, whether they are participating in the data copy to shared memory or not, will reach the barrier.