I'm using CUDA for the iterative Karatsuba algorithm and I would like to ask, why is one line computed always different.
First, I implemented this function, which computed the result always correctly:
__global__ void kernel_res_main(TYPE *A, TYPE *B, TYPE *D, TYPE *result, TYPE size, TYPE resultSize){
int i = blockDim.x * blockIdx.x + threadIdx.x;
if( i > 0 && i < resultSize - 1){
TYPE start = (i >= size) ? (i % size ) + 1 : 0;
TYPE end = (i + 1) / 2;
for(TYPE inner = start; inner < end; inner++){
result[i] += ( A[inner] + A[i - inner] ) * ( B[inner] + B[i - inner] );
result[i] -= ( D[inner] + D[i-inner] );
}
}
}
Now I would like to use the 2D grid and use CUDA for the for-loop, so I changed my function to this:
__global__ void kernel_res_nested(TYPE *A, TYPE *B, TYPE *D, TYPE *result, TYPE size, TYPE resultSize){
int i = blockDim.x * blockIdx.x + threadIdx.x;
int j = blockDim.y * blockIdx.y + threadIdx.y;
TYPE rtmp = result[i];
if( i > 0 && i < resultSize - 1){
TYPE start = (i >= size) ? (i % size ) + 1 : 0;
TYPE end = (i + 1) >> 1;
if(j >= start && j <= end ){
// WRONG
rtmp += ( A[j] + A[i - j] ) * ( B[j] + B[i - j] ) - ( D[j] + D[i - j] );
}
}
result[i] = rtmp;
}
I am calling this function like this:
dim3 block( 32, 8 );
dim3 grid( (resultSize+1/32) , (resultSize+7/8) );
kernel_res_nested <<<grid, block>>> (devA, devB, devD, devResult, size, resultSize);
And the result is alway wrong and always different. I can't understand why is that second implementation wrong and always computes wrong results. I can't see there any logical problem connected with data dependency. Does anyone know How can I solve this problem?
For question like this, you are supposed to provide a MCVE. (See item 1 here) For example, I don't know what type is indicated by TYPE
, and it does matter for the correctness of the solution I will propose.
In your first kernel, only one thread in your entire grid was reading and writing location result[i]
. But in your second kernel, you now have multiple threads writing to the result[i]
location. They are conflicting with each other. CUDA doesn't specify the order in which threads will run, and some may run before, after, or at the same time as, others. In this case, some threads may read result[i]
at the same time as others. Then, when the threads write their results, they will be inconsistent. And it may vary from run-to-run. You have a race condition there (execution order dependency, not data dependency).
The canonical method to sort this out would be to employ a reduction technique.
However for simplicity, I will suggest that atomics could help you sort it out. This is easier to implement based on what you have shown, and will help confirm the race condition. After that, if you want to try a reduction method, there are plenty of tutorials for that (one is linked above) and plenty of questions here on the cuda
tag about it.
You could modify your kernel to something like this, to sort out the race condition:
__global__ void kernel_res_nested(TYPE *A, TYPE *B, TYPE *D, TYPE *result, TYPE size, TYPE resultSize){
int i = blockDim.x * blockIdx.x + threadIdx.x;
int j = blockDim.y * blockIdx.y + threadIdx.y;
if( i > 0 && i < resultSize - 1){
TYPE start = (i >= size) ? (i % size ) + 1 : 0;
TYPE end = (i + 1) >> 1;
if(j >= start && j < end ){ // see note below
atomicAdd(result+i, (( A[j] + A[i - j] ) * ( B[j] + B[i - j] ) - ( D[j] + D[i - j] )));
}
}
}
Note that depending on your GPU type, and the actual type of TYPE
you are using, this may not work (may not compile) as-is. But since you had previously used TYPE
as a loop variable, I am assuming it is an integer type, and the necessary atomicAdd
for those should be available.
A few other comments:
This may not be giving you the grid size you expect:
dim3 grid( (resultSize+1/32) , (resultSize+7/8) );
I think the usual calculations there would be:
dim3 grid( (resultSize+31)/32, (resultSize+7)/8 );
I always recommend proper CUDA error checking and running your codes with cuda-memcheck
, any time you are having trouble with a CUDA code, to make sure there are no runtime errors.
It also looks to me like this:
if(j >= start && j <= end ){
should be this:
if(j >= start && j < end ){
to match your for-loop range. I am also making an assumption that size
is less than resultSize
(again, a MCVE would help).