How can I calculate blocks per grid?

Suppose I have a GPU that allows the MAX_THREAD number of threads per block.

Also, suppose it allows the MAX_BLOCK_DIM number of blocks per grid on each grid dimension of x, y, and z.

If MAX_THREAD = 1024, and if dim3 threads_per_block is set to [32, 8, 4], as 32*8*4=1024, how can I calculate each dimension of dim3 blocks_per_grid so that I can launch a kernel like the following?

my_kernel<<<blocks_per_grid, threads_per_block>>>(... ... ...);

For example,

dim3 threads_per_block(x, y, z);
dim3 blocks_per_grid(xx, yy, zz);

Can I calculate the values of xx, yy, and zz from x, y, and z, respectively?

If not, what is the proper way to do this?

Solution

You start with defining your grid dimensions. This depends on what you are doing. Let's say you have an image with dimensions 1024x768 and you do a pixel-wise computation. Then your grid would naturally be dim3(1024, 768, 1). If a you have a couple of frames from a video, it might be dim3(width, height, frames).

Your block dimensions are dictated by your kernel. If you have no specific need, I advise against using very large blocks such as 1024. This can lead to inefficient utilization because synchronization points such as __syncthreads() or the end of a kernel result in comparatibly many threads waiting for a few stragglers. When in doubt, stick with something small such as 128x1x1 or 16x16x1.

Given this, you can calculate the blocks.

dim3 grid(1024, 768, 1);
dim3 blockdim(16, 16, 1);
dim3 blocks((grid.x + blockdim.x - 1) / blockdim.x,
            (grid.y + blockdim.y - 1) / blockdim.y,
            (grid.z + blockdim.z - 1) / blockdim.z);

This computation is a division rounding up. This ensures that enough blocks are launched when the grid dimensions are not divisible by the block dimensions. The downside is that now you may launch more threads than needed. Effectively your grid gets padded to a multiple of the block dimension. There are multiple ways to cope with this, such as ensuring that array dimensions are always a multiple of 16. The simplest approach, however, is to simply check for out-of-range.

__global__ void kernel(int xdim, int ydim, int zdim)
{
    int x_idx = blockIdx.x * blockDim.x + threadIdx.x;
    int y_idx = blockIdx.y * blockDim.y + threadIdx.y;
    int z_idx = blockIdx.z * blockDim.z + threadIdx.z;
    if(x_idx < xdim && y_idx < ydim && z_idx < zdim) {
        do_something();
    }
}

Normally you can skip the checks in Y and Z dimension if the block dimensions along those axes are 1.

Another approach would be to start only as many threads as can concurrently run on the GPU and then make a loop inside the kernel. This has the benefit of reducing the proportion of the launch overhead compared to the actual work for large grids since all the index computations only need to be done once and more optimizations become possible by pulling parts of the computation out of the loop body.

__global__ void kernel(int xdim, int ydim, int zdim)
{
    int x_start = blockIdx.x * blockDim.x + threadIdx.x;
    int y_start = blockIdx.y * blockDim.y + threadIdx.y;
    int z_start = blockIdx.z * blockDim.z + threadIdx.z;
    int x_stride = gridDim.x * blockDim.x;
    int y_stride = gridDim.y * blockDim.y;
    int z_stride = gridDim.z * blockDim.z;
    for(int z_idx = z_start; z_idx < zdim; z_idx += z_stride)
        for(int y_idx = y_start; y_idx < ydim; y_idx += y_stride)
            for(int x_idx = x_start; x_idx < xdim; x_idx += x_stride)
                do_something();
}