I want to implement Gaussian elimination in CUDA. But I have problem with thread synchronization inside if/else.
Here is my simple code:
__device__ bool zr(float val) {
const float zeroEpsilon = 1e-12f;
return fabs(val) < zeroEpsilon;
}
__global__ void gauss(float* data, unsigned int size, bool* success) {
//unsigned int len = size * (size + 1);
extern __shared__ float matrix[];
__shared__ bool succ;
__shared__ float div;
unsigned int ridx = threadIdx.y;
unsigned int cidx = threadIdx.x;
unsigned int idx = (size + 1) * ridx + cidx;
matrix[idx] = data[idx];
if (idx == 0)
succ = true;
__syncthreads();
for (unsigned int row = 0; row < size; ++row) {
if (ridx == row) {
if (cidx == row) {
div = matrix[idx];
if (zr(div)) {
succ = false;
div = 1.0;
}
}
__syncthreads();
matrix[idx] = matrix[idx] / div;
__syncthreads();
}
else {
__syncthreads();
__syncthreads();
}
if (!succ)
break;
}
__syncthreads();
if (idx == 0)
*success = succ;
data[idx] = matrix[idx];
__syncthreads();
}
It works this way:
The problem is inside if/else block inside for loop - deadlock:
==Ocelot== PTX Emulator failed to run kernel "_Z5gaussPfjPb" with exception:
==Ocelot== [PC 30] [thread 0] [cta 0] bar.sync 0 - barrier deadlock:
==Ocelot== context at: [PC: 59] gauss.cu:57:1 11111111111111111111
==Ocelot== context at: [PC: 50] gauss.cu:54:1 11111111111111111111
==Ocelot== context at: [PC: 33] gauss.cu:40:1 00000000000000011111
==Ocelot== context at: [PC: 30] gauss.cu:51:1 11111111111111100000
I hava no idea why is that. When I remove synchronization from if/else block it's works. Can someone explain me that?
__syncthreads()
is doing this.
When a thread
reaches __syncthreads
as an instruction it will block/stall, when that happens the warp
(32 threads) will also block, it will block until all threads
in the same block of threads
have reached that statement.
However, if one warp or one thread in the same block of threads
does not reach the same __syncthreads
statement it will deadlock as at least one thread
is waiting for all the other threads
to reach the same statement, if if does not happen you will get a deadlock.
What you are now doing is excluding at least one threads
from participating in the __syncthreads
event, by placing a __syncthreads
inside a if statement not all threads will reach. Thus, deadlock.