Cuda/PyCuda - Large matrix traversal and block/grid size

I am working on something that has highlighted the fact I don't have a firm grasp of how blocks and grids work in cuda. I have a 1000x10 matrix that I would like to traverse and fill in each element with a value. The kernel is like this:

__global__ void myfun(float *vals,float *out, int M, int N)
  {  
      int row = blockIdx.y*blockDim.y + threadIdx.y;
      int col = blockIdx.x*blockDim.x + threadIdx.x;
      int index = row*N + col;

      if( (row < M ) && (col < N) ) {
          out[index] = index;
      }
}

where, M=1000 and N = 10. I don't know how to slice this up so that I can cover every element in the matrix. Since I need coverage for 1000*10 = 10,000 elements and given the limitations on the number of threads, I can't use block sizes of (10,1000,1). Using pycuda, I've tried things like block = (10,100,1), grid = (1,10) but I never get full coverage of the matrix elements. What's the right way to do this?

Solution

Fix the block size, and keep the grid size dynamic. In this way, the kernel will cover each element of the matrix no matter what the values of M and N are.

block = (8,8)
grid = ((N + 7) / 8, (M + 7) / 8)

Launch the kernel with this grid and block configuration. Keeping in limits of the device, you may change the block size if desired.