Race Condition in CUDA Kernel

I have a CUDA kernel that appears to have race condition and am trying to pinpoint where this race condition is coming from. I am aware of the 'racecheck' tool of cuda-memcheck, however racecheck tells me that there is no hazard when using small inputs which is actually consistent with my own investigations too. For large inputs though racecheck seems to take forever (literally) and so I can't use it. Briefly explaining, a 1D vector d_mat_3d defined as a __device__ variable is filled with 0 and loaded in global memory. Two large arrays which are the inputs for the kernel (d_A and d_v) are also defined in main and passed to the kernal. A segment of array d_mat_3d, called mat_2d is cut, loaded in shared memory and some processing will be done on it. Then, mat_2d will be written back to d_mat_3d on global memory.

As shown here, atomic operations are used as without the use of atomic operations mat_2d would encounter a race condition b/w different threads.

The reason I guess I still have some sort of race condition going on is that the results of mat_3d is different every time.

Any thought as to where this race condition may come from? Any steps I can take to clear that out (other than the tool racecheck)? If you think, there is no evidence for race condition, can you explain why different values are assigned to d_mat_3d every time I execute the kernel?

CUDA 9.0 / NVidia Titan Black / Ubuntu 16.04

#include <cstdlib>
#include <sstream>
#include <cstdio>
#include <cuda.h>
#include <cuda_runtime_api.h>

#define W 7              // fix limit for loops in kernel
#define SIZE 100         // defining matrix dimension
#define N_ELEM 10000     // no of elements in each vector
#define NTPB 1024        // no of threads per block

using namespace std;

__device__ float d_mat_3d[SIZE*SIZE*SIZE]; 

__global__ void cuda_kernel(float *d_A, float *d_v){

  __shared__ float mat_2d[SIZE*SIZE]; // a 2D slice of 3D matrix d_mat_3d

  unsigned int n = blockDim.x*blockIdx.x+threadIdx.x;

  if(n >= N_ELEM)
    return;

  int x, y, z, i;
  float r;
  float A = d_A[n];
  float v = d_v[n];

  #pragma unroll
  for(x=0; x<SIZE; x++){

    // load mat_2d (on shared memory) using d_mat_3d (on global memory)
    for(i=0; i<SIZE*SIZE; i++){
      mat_2d[i] = d_mat_3d[i+x*SIZE*SIZE];
    }

    // sync threads as mat_2d is on shared memory
    __syncthreads();

    for(y=SIZE/2; y<SIZE/2+W; y++){ 
      for(z=SIZE/2; z<SIZE/2+W; z++){
        r = sqrt( pow(A,2) / v );  // no need to be in these loops. I know, but for my real case, it must be.
        atomicAdd(&mat_2d[z+y*SIZE], r); // atomically add r 
      }
    }

    __syncthreads();
    // write mat_2d (shared memory) back to mat_3d (global memory)
    for(i=0; i<SIZE*SIZE; i++){
      d_mat_3d[i+x*SIZE*SIZE] = mat_2d[i];
    }
  }
}

// this function writes h_mat_3d to disk. 
void write_image(float *h_mat_3d){
  ostringstream o_addToFile;
  o_addToFile << "mat3d.bin";
  FILE *pFile; 
  pFile = fopen(o_addToFile.str().c_str(), "wb");
  for(int i=0; i<SIZE*SIZE*SIZE; i++){ 
    fwrite(&h_mat_3d[i], sizeof(float), 1, pFile);
  }
  fclose (pFile);
}

int main(){

  int i;
  float *h_A = new float[N_ELEM]; // some large vector
  float *h_v = new float[N_ELEM]; // some other large vector
  float h_mat_3d[SIZE*SIZE*SIZE]; // will be filled w/ 0
  float *d_A; // device variables
  float *d_v;

  for(i=0; i<N_ELEM; i++){
    h_A[i] = 0.2f+(float)i/N_ELEM; // fill out with some calculations
    h_v[i] = 0.5f+2.f*i/N_ELEM;
  }
  for(i=0; i<SIZE*SIZE*SIZE; i++){
    h_mat_3d[i] = 0.f; // fill h_mat_3d with 0 
  }

  cudaMalloc((void **)&d_A, sizeof(float)*N_ELEM); // allocate variables on device
  cudaMalloc((void **)&d_v, sizeof(float)*N_ELEM);

  cudaMemcpy(d_A, h_A, sizeof(float)*N_ELEM, cudaMemcpyHostToDevice); // copy from host to device
  cudaMemcpy(d_v, h_v, sizeof(float)*N_ELEM, cudaMemcpyHostToDevice);
  cudaMemcpyToSymbol(d_mat_3d, &h_mat_3d, sizeof(float)*SIZE*SIZE*SIZE); // copy h_mat_3d to device

  cuda_kernel<<<(N_ELEM+NTPB-1)/NTPB,NTPB>>>(d_A, d_v); // execute kernel

  cudaMemcpyFromSymbol(h_mat_3d, d_mat_3d, sizeof(float)*SIZE*SIZE*SIZE); // write it back to h_mat_3d

  write_image(h_mat_3d); // write h_mat_3d to disk for checking

  cudaFree(d_A); // free memory
  cudaFree(d_v);
  delete [] h_A;
  delete [] h_v;

  return 0;
}

Solution

Yes, you have at least 2 different race conditions in your code.

Since you are loading the entirety of shared memory in a loop (i.e. loading it all, over and over again, in a loop), it's necessary to protect both the start and the end of the load operation with __syncthreads(). Doing so will reduce the variability from run-to-run down to the 6th or 7th significant decimal digit, which is consistent with ordinary float variability in floating-point operations, where the order of operations is not duplicated (which will generally be the case here).

adding the following line:
```
  for(x=0; x<SIZE; x++){
    __syncthreads();  // add this line
    // load mat_2d (on shared memory) using d_mat_3d (on global memory)
    for(i=0; i<SIZE*SIZE; i++){
      mat_2d[i] = d_mat_3d[i+x*SIZE*SIZE];
    }

    // sync threads as mat_2d is on shared memory
    __syncthreads();
```
should mostly correct the issue. Without that, as your kernel loops in x, some warps can "race ahead" to start loading shared memory, while previous warps are still busy with the previous loop iteration in x (and take note of comment 2 below, which probably exacerbates this issue.)
Since each threadblock is writing to the entirety of d_mat_3d, you have a race condition as each threadblock attempts to write various values. The order of threadblock execution (undefined by CUDA) will mostly determine what ends up there, and this can easily vary run-to-run. The only trivial way I know of to sort this out without a complete kernel re-write is to simply launch 1 threadblock (which will still populate the same regions of d_mat_3d). This sort of race condition is a global memory race, and cuda-memcheck cannot discover this kind of race, currently. I hesitate to read too much into this, but this code doesn't really make any sense, and either indicates a lack of concern for a sensible code, or a lack of understanding of the CUDA execution model (especially coupled with item 2 below.)

There are a few other things I would point out.

Your use of __syncthreads() is potentially illegal in the last threadblock. This construct:
```
  if(n >= N_ELEM)
    return;
```
will allow some threads in the (last) threadblock to retire early, meaning they will not participate in the subsequent __syncthreads() statements. This is illegal in CUDA, and the restriction is covered in the programming guide. This is fixable by removing the early return, and protecting the various segments of your kernel loop (except the __syncthreads() statements) with if (n < N_ELEM) or similar.
Your kernel code is generally strange, as you have pointed out in your comments already. One example of this is that you have every thread in the block doing the exact same loads and stores to/from shared memory. This is wasteful, performance-wise, in several ways.

I'm not suggesting this covers every issue with the code, merely the things I noticed. Here is a relatively complete test case that I used to verify my findings. It includes some changes to address items I mentioned above, as well as various other changes that seemed important to me:

$ cat t268.cu
#include <cstdlib>
#include <sstream>
#include <cstdio>
#include <cuda.h>
#include <cuda_runtime_api.h>

#define W 7              // fix limit for loops in kernel
#define SIZE 100         // defining matrix dimension
#define N_ELEM 10000     // no of elements in each vector
#define NTPB 1024        // no of threads per block

using namespace std;

__device__ float d_mat_3d[SIZE*SIZE*SIZE];

__global__ void cuda_kernel(float *d_A, float *d_v){

  __shared__ float mat_2d[SIZE*SIZE]; // a 2D slice of 3D matrix d_mat_3d

  unsigned int n = blockDim.x*blockIdx.x+threadIdx.x;


  int x, y, z, i;
  float r;
  float A = d_A[n];
  float v = d_v[n];

  #pragma unroll
  for(x=0; x<SIZE; x++){
  __syncthreads();
if (n < N_ELEM){
    // load mat_2d (on shared memory) using d_mat_3d (on global memory)
    for(i=0; i<SIZE*SIZE; i++){
      mat_2d[i] = d_mat_3d[i+x*SIZE*SIZE];
    }
}
    // sync threads as mat_2d is on shared memory
    __syncthreads();
if (n < N_ELEM){
    for(y=SIZE/2; y<SIZE/2+W; y++){
      for(z=SIZE/2; z<SIZE/2+W; z++){
        r = sqrt( pow(A,2) / v );  // no need to be in these loops. I know, but for my real case, it must be.
        atomicAdd(&(mat_2d[z+y*SIZE]), r); // atomically add r
      }
    }
}
    __syncthreads();
    // write mat_2d (shared memory) back to mat_3d (global memory)
if (n < N_ELEM){
    for(i=0; i<SIZE*SIZE; i++){
      d_mat_3d[i+x*SIZE*SIZE] = mat_2d[i];
    }
}
  }
}

// this function writes h_mat_3d to disk.
void write_image(float *h_mat_3d){
  for (int i = 0; i < SIZE*SIZE; i++){
    for (int j = 0; j < SIZE; j++)
      if (h_mat_3d[i*SIZE+j] > 1.0f) printf("%d:%f\n ", i*SIZE+j,  h_mat_3d[i*SIZE+j]);
    printf("\n");}
}

int main(){

  int i;
  float *h_A = new float[N_ELEM]; // some large vector
  float *h_v = new float[N_ELEM]; // some other large vector
  float *h_mat_3d = new float[SIZE*SIZE*SIZE]; // will be filled w/ 0
  float *d_A; // device variables
  float *d_v;

  for(i=0; i<N_ELEM; i++){
    h_A[i] = 0.2f+i/(float)N_ELEM; // fill out with some calculations
    h_v[i] = 0.5f+2.f*i/(float)N_ELEM;
  }
  for(i=0; i<SIZE*SIZE*SIZE; i++){
    h_mat_3d[i] = 0.f; // fill h_mat_3d with 0
  }

  cudaMalloc((void **)&d_A, sizeof(float)*N_ELEM); // allocate variables on device
  cudaMalloc((void **)&d_v, sizeof(float)*N_ELEM);

  cudaMemcpy(d_A, h_A, sizeof(float)*N_ELEM, cudaMemcpyHostToDevice); // copy from host to device
  cudaMemcpy(d_v, h_v, sizeof(float)*N_ELEM, cudaMemcpyHostToDevice);
  cudaMemcpyToSymbol(d_mat_3d, h_mat_3d, sizeof(float)*SIZE*SIZE*SIZE); // copy h_mat_3d to device

  cuda_kernel<<<1,NTPB>>>(d_A, d_v); // execute kernel

  cudaMemcpyFromSymbol(h_mat_3d, d_mat_3d, sizeof(float)*SIZE*SIZE*SIZE); // write it back to h_mat_3d

  write_image(h_mat_3d); // write h_mat_3d to disk for checking

  cudaFree(d_A); // free memory
  delete [] h_A;
  delete [] h_v;

  return 0;
}
$ nvcc -arch=sm_52 -o t268 t268.cu
$ ./t268 > out1.txt
$ ./t268 > out2.txt
$ diff out1.txt out2.txt |more
51,57c51,57
< 5050:330.657715
<  5051:330.657715
<  5052:330.657715
<  5053:330.657715
<  5054:330.657715
<  5055:330.657715
<  5056:330.657715
---
> 5050:330.657654
>  5051:330.657593
>  5052:330.657593
>  5053:330.657593
>  5054:330.657593
>  5055:330.657593
>  5056:330.657593
59,65c59,65
< 5150:330.657715
<  5151:330.657715
<  5152:330.657715
<  5153:330.657715
<  5154:330.657745
<  5155:330.657745
<  5156:330.657745
---
> 5150:330.657593
>  5151:330.657593
>  5152:330.657593
>  5153:330.657593
>  5154:330.657593
>  5155:330.657593
>  5156:330.657593
67,73c67,73
< 5250:330.657745
<  5251:330.657745
<  5252:330.657745
<  5253:330.657745
<  5254:330.657715
<  5255:330.657715
<  5256:330.657715
---
> 5250:330.657593
>  5251:330.657593
>  5252:330.657623
>  5253:330.657593
>  5254:330.657593
>  5255:330.657593
>  5256:330.657593
75,81c75,81
< 5350:330.657715
<  5351:330.657715
<  5352:330.657715
<  5353:330.657715
<  5354:330.657715
<  5355:330.657745
<  5356:330.657715
---
> 5350:330.657593
>  5351:330.657593
$

As can be seen, the remaining variation is in the 7th significant decimal digit:

51,57c51,57
< 5050:330.657715
...
---
> 5050:330.657654