Search code examples

Cuda/PyCuda - Large matrix traversal and block/grid size

I am working on something that has highlighted the fact I don't have a firm grasp of how blocks and grids work in cuda. I have a 1000x10 matrix that I would like to traverse and fill in each element with a value. The kernel is like this:

__global__ void myfun(float *vals,float *out, int M, int N)
      int row = blockIdx.y*blockDim.y + threadIdx.y;
      int col = blockIdx.x*blockDim.x + threadIdx.x;
      int index = row*N + col;

      if( (row < M ) && (col < N) ) {
          out[index] = index;

where, M=1000 and N = 10. I don't know how to slice this up so that I can cover every element in the matrix. Since I need coverage for 1000*10 = 10,000 elements and given the limitations on the number of threads, I can't use block sizes of (10,1000,1). Using pycuda, I've tried things like block = (10,100,1), grid = (1,10) but I never get full coverage of the matrix elements. What's the right way to do this?


  • Fix the block size, and keep the grid size dynamic. In this way, the kernel will cover each element of the matrix no matter what the values of M and N are.

    block = (8,8)
    grid = ((N + 7) / 8, (M + 7) / 8)

    Launch the kernel with this grid and block configuration. Keeping in limits of the device, you may change the block size if desired.