I am trying to reduce modulo p each element of a matrix, and I am wondering whether I got the specification of thread indexes for stride loops in two dimensions right.
Here is my kernel :
__global__
void gpu_matrix_fma_reduction(double *matrix, int rows, int cols, double u, double p){
/* Reduce each coefficient of a matrix modulo p (with u = 1.0/p) */
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < rows; i += blockIdx.x*gridDim.x) {
for (int j = blockIdx.y * blockDim.y + threadIdx.y; j < cols; j += blockIdx.y*gridDim.y) {
gpu_fma_reduction(matrix + i*cols + j, u, p);
}
}
}
Here is the call to the kernel :
dim3 threadsPerBlock(16, 16);
dim3 numBlocks( n*m / threadsPerBlock.x, n*m / threadsPerBlock.y);
gpu_matrix_fma_reduction<<<numBlocks, threadsPerBlock>>>(partial_matrix, n, m, u, p);
I get an infinite loop. I am not sure yet whether it is due to this kernel.
EDIT: replaced rows by cols in the function call.
The answer has been given in the comments:
__global__
void gpu_matrix_fma_reduction(double *matrix, int rows, int cols, double u, double p){
/* Reduce each coefficient of a matrix modulo p (with u = 1.0/p) */
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < rows; i += blockDim.x*gridDim.x) {
for (int j = blockIdx.y * blockDim.y + threadIdx.y; j < cols; j += blockDim.y*gridDim.y) {
gpu_fma_reduction(matrix + i*cols + j, u, p);
}
}
}
The blockIdx.x * gridDim.x
are not correct increments. The correct increment should be blockDim.x*gridDim.x
.
Now, this is terribly slow, and a two dimensional loop is not the most adapted pathing for this problem. Since we only apply one function to each element, we can go through the matrix like a one dimensional vector. This would prevent jumps in index accesses, and memory can be better handled if it is accessed in the storage order :
__global__
void gpu_matrix_fma_reduction(double *matrix, int rows, int cols, double u, double p){
/* Reduce each coefficient of a matrix modulo p (with u = 1.0/p) */
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < rows*cols; i += blockIdx.x*gridDim.x) {
gpu_fma_reduction(matrix + i, u, p);
}
}
The kernel call must be adapted :
int threadsPerBlock(256);
int numBlocks(n*m / threadsPerBlock.x);
gpu_matrix_fma_reduction<<<numBlocks, threadsPerBlock>>>(partial_matrix, n, m, u, p);