wrong partial zero result from copying shared memory to global memory

I wrote a simple CUDA kernel as follows:

__global__ void cudaDoSomethingInSharedMemory(float* globalArray, pitch){

    __shared__ float sharedInputArray[1088];
    __shared__ float sharedOutputArray[1088];

    int tid = threadIdx.x //Use 1D block
    int rowIdx = blockIdx.x //Use 1D grid

    int rowOffset = pitch/sizeof(float);//Offset in elements (not in bytes)

    //Copy data from global memory to shared memory (checked)
    while(tid < 1088){
        sharedInputArray[tid] = *(((float*) globalArray) + rowIdx*rowOffset + tid);
        tid += blockDim.x;
        __syncthreads();
    }
    __syncthreads();

    //Do something (already simplified and the problem still exists)
    tid = threadIdx.x;
    while(tid < 1088){
        if(tid%2==1){
            if(tid == 1087){
                sharedOutputArray[tid/2 + 544] = 321;
            }
            else{
                sharedOutputArray[tid/2 + 544] = 321;
            }
        }
        tid += blockDim.x;
        __syncthreads();
    }
    
    tid = threadIdx.x;
    while(tid < 1088){
        if(tid%2==0){
            if(tid==0){
                sharedOutputArray[tid/2] = 123;
            }
            else{
                sharedOutputArray[tid/2] = 123;
            }

        }
        tid += blockDim.x;
        __syncthreads();
    }
    __syncthreads();

    //Copy data from shared memory back to global memory (and add read-back for test)
    float temp = -456;
    tid = threadIdx.x;
    while(tid < 1088){
        *(((float*) globalArray) + rowIdx*rowOffset + tid) = sharedOutputArray[tid];
        temp = *(((float*) globalArray) + rowIdx*rowOffset + tid);//(1*) Errors are found.
        __syncthreads();
        tid += blockDim.x;
    }
    __syncthreads();
}

The code is to change sharedOutputArray from "interlaced" to "clustered":

123 321 123 321 ... 123 321

is changed to

123 123 123.. 123 321 321 321...321

and output the clustered result to the global memory array globalArray. globalArray is allocated using cudaMallocPitch()

This kernel is used to process a 2D array. The idea is simple: One block for one row (so 1D grid and the number of blocks equals the number of rows) and N threads for each row. The row number is 1920 and column number is 1088. So there are 1920 blocks.

The problem is: When N (the number of threads in one block) is 64, 128 or 256, everything works (at least looks like working) fine. However, when N was 512 (I am using GTX570 with CUDA computation capability 2.0 and the maximum size for each dimension of one block is 1024), the errors happened.

The errors are: The elements (each one is a 4-byte floating point number) in a row in the global memory from position 256 to 287 (index starts at 0, error strip length is 32 elements, 128 bits) is 0 rather than 123. It looks like

123 123 123 ... 0 0 0 0 0... 0 123 123 ...

I checked the line above (1*) and those elements were 123 in sharedOutputArray and when the element (for example tid==270) was read in (1*), temp showed 0. I tried to see tid==255 and tid==288 and the element was 123 (correct). This type of error happened in almost all 1920 rows.

I tried to "synchronize" (maybe already over-synchronized) the threads but it did not work. What makes me confused is why 64, 128 or 256 threads worked fine but 512 did not work. I know using 512 threads may not be optimized for the performance and I just would like to know where I made the mistake.

Thank you in advance.

Solution

You are using __syncthreads() inside conditional code where the condition does not evaluate uniformly between the threads of a block. Don't do that.

In your case you can simply remove the __syncthreads() inside the while loops, as it serves no purpose.