visual-c++cuda thread-synchronization release-mode

Incorrect synchronization inside a "while" loop (occuring only in Release mode)

I have a kernel with a "while" loop, which iteratively updates elements of an array using information about neighbors (only one neighbor in the sample code below). This loop stops when no element is changed at the current iteration.

Unfortunately, in some situations part of threads go out of this loop prematurely (like if they ignore synchronization barrier). Some inputs are processed correctly every time, and other inputs (many of them) are processed incorrectly every time (i.e. there are no stochastic factors). Strangely, this error occurs only in Release version while Debug version always worked fine. More precisely, the CUDA compiler option "-G (Generate GPU Debug Information)" determines whether the processing is correct. Arrays of size 32x32 or smaller are always processed correctly.

Here is a sample code:

__global__ void kernel(int *source, int size, unsigned char *result, unsigned char *alpha)
{
    int x = threadIdx.x, y0 = threadIdx.y * 4;
    int i, y;
    __shared__ bool alpha_changed;

    // Zero intermediate array using margins for safe access to neighbors
    const int stride = MAX_SIZE + 2;
    for (i = threadIdx.x + threadIdx.y * blockDim.x; i < stride * (stride + 3); i += blockDim.x * blockDim.y)
    {
        alpha[i] = 0;
    }
    __syncthreads();

    for (int bit = MAX_BITS - 1; bit >= 0; bit--)
    {
        __syncthreads();

        // Fill intermediate array with bit values from input array
        alpha_changed = true;
        alpha[(x + 1) + (y0 + 1) * stride] = (source[x + (y0 + 0) * size] & (1 << bit)) != 0;
        alpha[(x + 1) + (y0 + 2) * stride] = (source[x + (y0 + 1) * size] & (1 << bit)) != 0;
        alpha[(x + 1) + (y0 + 3) * stride] = (source[x + (y0 + 2) * size] & (1 << bit)) != 0;
        alpha[(x + 1) + (y0 + 4) * stride] = (source[x + (y0 + 3) * size] & (1 << bit)) != 0;
        __syncthreads();

        // The loop in question
        while (alpha_changed)
        {
            alpha_changed = false;
            __syncthreads();
            if (alpha[(x + 0) + (y0 + 1) * stride] != 0 && alpha[(x + 1) + (y0 + 1) * stride] == 0)
            {
                alpha_changed = true;
                alpha[(x + 1) + (y0 + 1) * stride] = 1;
            }
            __syncthreads();
            if (alpha[(x + 0) + (y0 + 2) * stride] != 0 && alpha[(x + 1) + (y0 + 2) * stride] == 0)
            {
                alpha_changed = true;
                alpha[(x + 1) + (y0 + 2) * stride] = 1;
            }
            __syncthreads();
            if (alpha[(x + 0) + (y0 + 3) * stride] != 0 && alpha[(x + 1) + (y0 + 3) * stride] == 0)
            {
                alpha_changed = true;
                alpha[(x + 1) + (y0 + 3) * stride] = 1;
            }
            __syncthreads();
            if (alpha[(x + 0) + (y0 + 4) * stride] != 0 && alpha[(x + 1) + (y0 + 4) * stride] == 0)
            {
                alpha_changed = true;
                alpha[(x + 1) + (y0 + 4) * stride] = 1;
            }
            __syncthreads();
        }
        __syncthreads();

        // Save result
        result[x + (y0 + 0) * size + bit * size * size] = alpha[(x + 1) + (y0 + 1) * stride];
        result[x + (y0 + 1) * size + bit * size * size] = alpha[(x + 1) + (y0 + 2) * stride];
        result[x + (y0 + 2) * size + bit * size * size] = alpha[(x + 1) + (y0 + 3) * stride];
        result[x + (y0 + 3) * size + bit * size * size] = alpha[(x + 1) + (y0 + 4) * stride];
        __syncthreads();
    }
}

// Run only 1 thread block, where size equals 64.
kernel <<< 1, dim3(size, size / 4) >>> (source_gpu, size, result_gpu, alpha_gpu);

The expected result of this sample kernel is array, where each line can contain only contiguous intervals of "1" values. But instead of this, I get some lines, where "0" and "1" are somehow alternated.

This error is reproduced on my mobile GPU GeForce 740M (Kepler), on Windows 7 x64 SP1, on either CUDA 6.0 or 6.5, using either Visual C++ 2012 or 2013. I can also provide a sample Visual Studio project with the sample input array (i.e. which is processed incorrectly).

I have already tried different configurations of syncthreads(), fences and "volatile" qualifier, but this error remained.

Any help is appreciated.

Solution

I think the problem is your access to alpha_changed. Keep in mind this is only one value for all the threads in a block. There is a race condition between one warp resetting this variable, and another warp checking the loop condition:

    // The loop in question
    while (alpha_changed)
    {
        alpha_changed = false;
        // ...
        // alpha_changed may be set to true here
        // ...

        __syncthreads();

        // race condition window here. Another warp may already execute
        // the alpha_changed = false; line before this warp continues.
    }

The key thing is doing a __syncthreads() before setting the shared variable to false.

You can use a local variable inside the loop to figure out if that thread made any change. This avoids having to use __syncthreads() all over the place. Then do a reduction in the end of the loop:

    // The loop in question
    while (alpha_changed)
    {
        bool alpha_changed_here = false;
        // ...
        // alpha_changed_here may be set to true here
        // ...

        __syncthreads();
        alpha_changed = false;
        __syncthreads();
        // I think you can get away with a simple if-statement here
        // instead of a proper reduction
        if (alpha_changed_here) alpha_changed = true;
        __syncthreads();
    }

As far as I know, this method of using just one variable in shared memory currently works. If you want to be sure, use a proper reduction algorithm. You can use __any() to do a reduction for 32 values in one instruction by one warp. The algorithm to use depends on the size of your blocks (I don't know the exact behavior is the size is not a multiple of 32).