Search code examples
cudaparallel-processinggaussianthread-synchronization

Thread synchronization inside if/else block in CUDA


I want to implement Gaussian elimination in CUDA. But I have problem with thread synchronization inside if/else.

Here is my simple code:

__device__ bool zr(float val) {
    const float zeroEpsilon = 1e-12f;
    return fabs(val) < zeroEpsilon;
}

__global__ void gauss(float* data, unsigned int size, bool* success) {
    //unsigned int len = size * (size + 1);
    extern  __shared__ float matrix[];
    __shared__ bool succ;
    __shared__ float div;
    unsigned int ridx = threadIdx.y;
    unsigned int cidx = threadIdx.x;
    unsigned int idx = (size + 1) * ridx  + cidx;
    matrix[idx] = data[idx];
    if (idx == 0)
        succ = true;
    __syncthreads();
    for (unsigned int row = 0; row < size; ++row) {
        if (ridx == row) {
            if (cidx == row) {
                div = matrix[idx];
                if (zr(div)) {
                    succ = false;
                    div = 1.0;
                }
            }
            __syncthreads();
            matrix[idx] = matrix[idx] / div;
            __syncthreads();
        }
        else {
            __syncthreads();
            __syncthreads();
        }
        if (!succ)
            break;
    }
    __syncthreads();
    if (idx == 0)
        *success = succ;
    data[idx] = matrix[idx];
    __syncthreads();
}

It works this way:

  1. Copy matrix into shared memory.
  2. Iterate over rows.
  3. Divide row by value on it's diagonal.

The problem is inside if/else block inside for loop - deadlock:

==Ocelot== PTX Emulator failed to run kernel "_Z5gaussPfjPb" with exception: 
==Ocelot== [PC 30] [thread 0] [cta 0] bar.sync 0 - barrier deadlock:
==Ocelot== context at: [PC: 59] gauss.cu:57:1 11111111111111111111
==Ocelot== context at: [PC: 50] gauss.cu:54:1 11111111111111111111
==Ocelot== context at: [PC: 33] gauss.cu:40:1 00000000000000011111
==Ocelot== context at: [PC: 30] gauss.cu:51:1 11111111111111100000

I hava no idea why is that. When I remove synchronization from if/else block it's works. Can someone explain me that?


Solution

  • __syncthreads() is doing this.

    When a thread reaches __syncthreads as an instruction it will block/stall, when that happens the warp (32 threads) will also block, it will block until all threads in the same block of threads have reached that statement.

    However, if one warp or one thread in the same block of threads does not reach the same __syncthreads statement it will deadlock as at least one thread is waiting for all the other threads to reach the same statement, if if does not happen you will get a deadlock.

    What you are now doing is excluding at least one threads from participating in the __syncthreads event, by placing a __syncthreads inside a if statement not all threads will reach. Thus, deadlock.