for-loop matrix parallel-processing cuda multi-index

How to parallelize evaluation of a function to each element of a matrix in CUDA

I am trying to reduce modulo p each element of a matrix, and I am wondering whether I got the specification of thread indexes for stride loops in two dimensions right.

Here is my kernel :

__global__
void gpu_matrix_fma_reduction(double *matrix, int rows, int cols, double u, double p){
  /* Reduce each coefficient of a matrix modulo p (with u = 1.0/p) */
  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < rows; i += blockIdx.x*gridDim.x) {
    for (int j = blockIdx.y * blockDim.y + threadIdx.y; j < cols; j += blockIdx.y*gridDim.y) {
      gpu_fma_reduction(matrix + i*cols + j, u, p);
    }
  }
}

Here is the call to the kernel :

dim3 threadsPerBlock(16, 16);
dim3 numBlocks( n*m / threadsPerBlock.x, n*m / threadsPerBlock.y);
gpu_matrix_fma_reduction<<<numBlocks, threadsPerBlock>>>(partial_matrix, n, m, u, p);

I get an infinite loop. I am not sure yet whether it is due to this kernel.

EDIT: replaced rows by cols in the function call.

Solution

The answer has been given in the comments:

__global__
void gpu_matrix_fma_reduction(double *matrix, int rows, int cols, double u, double p){
  /* Reduce each coefficient of a matrix modulo p (with u = 1.0/p) */
  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < rows; i += blockDim.x*gridDim.x) {
    for (int j = blockIdx.y * blockDim.y + threadIdx.y; j < cols; j += blockDim.y*gridDim.y) {
      gpu_fma_reduction(matrix + i*cols + j, u, p);
    }
  }
}

The blockIdx.x * gridDim.x are not correct increments. The correct increment should be blockDim.x*gridDim.x.

Now, this is terribly slow, and a two dimensional loop is not the most adapted pathing for this problem. Since we only apply one function to each element, we can go through the matrix like a one dimensional vector. This would prevent jumps in index accesses, and memory can be better handled if it is accessed in the storage order :

__global__
void gpu_matrix_fma_reduction(double *matrix, int rows, int cols, double u, double p){
  /* Reduce each coefficient of a matrix modulo p (with u = 1.0/p) */
  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < rows*cols; i += blockIdx.x*gridDim.x) {
      gpu_fma_reduction(matrix + i, u, p);
    }
}

The kernel call must be adapted :

int threadsPerBlock(256);
int numBlocks(n*m / threadsPerBlock.x);
gpu_matrix_fma_reduction<<<numBlocks, threadsPerBlock>>>(partial_matrix, n, m, u, p);