Why doesn't CUDA synchronization point prevent race condition?

We run the cuda-memcheck --tool racecheck <executable> on our code. We get the following memory hazard errors.

========= Race reported between Read access at 0x00004098 CUDA.cu:123:KernelFunction()
=========     and Write access at 0x00005058 in CUDA.cu:146:KernelFunction() [529996 hazards]

Here's the code. It claims that line 123 value = sharedMemory0[sharedMemoryIndex]; is in a race condition with line 146 sharedMemory0[sharedIndex0] = sharedMemory1[sharedIndex1];. We have

// Synchronization Point 1 
__syncthreads(); 
__threadfence_block();

between the two lines. Shouldn't all the threads synchronize at that point and all the previous memory read/writes complete at that point? All the threads and memory accesses should complete after the first j-loop before starting the second j-loop. So in our minds Synchronization Point 1 should isolate the two j-loops and prevent a race condition, but the tool says that's not true.

Why is the tool reporting a race condition? Any insights as to what we could do to prevent it?

We've also seen references to a tool that might be able to report a trace of the execution to more easily see the race condition. What tool and options can we use to get a trace to see more clearly why the race condition exists?

   for (i = 0; i < COUNT0; i++) {
       // Synchronization Point 0
       __syncthreads();
       __threadfence_block();
       for (j = 0; j < COUNT1; j++) {
          index = j*blockDim.x + threadIdx.x;
          if (index < THREAD_COUNT0) {
             for (k = 0; k < COUNT2; k++)
                sharedMemoryIndex = function0(index);
                value = sharedMemory0[sharedMemoryIndex];
             }
          }         
       }

       // Synchronization Point 1
       __syncthreads();
       __threadfence_block();
       for (j = 0; j < COUNT2; j++) {
          index = j*blockDim.x + threadIdx.x;
          if (index < THREAD_COUNT1) {
            sharedIndex0 = function1(index);
            sharedIndex1 = function2(index);
            sharedMemory0[sharedIndex0] = sharedMemory1[sharedIndex1];
          }
       }
    }

We've also run the Synccheck tool, cuda-memcheck --tool synccheck <executable> and it reported the following error on Synchronization Point 1. There's probably a strong correlation between the two errors, but there isn't very much documentation in the cuda-memcheck guide about what synchronization of divergent code is, why it's bad, and how to fix it.

Any comments?

========= Barrier error detected. Encountered barrier with divergent threads in block
=========     at 0x00004ad8 in CUDA.cu:139:KernelFunction()
=========     by thread (0,0,0) in block (8,0,0)

Solution

This code gives the same results independently of the number of threads used to perform the calculations. We ran the code with only a single thread and then ran the code with multiple threads. By definition, it's not possible for a single threaded version to run into a race condition. And yet, the single threaded version gave identical results to the multi-threaded version. cuda-memcheck --tool racecheck reports many, many race violations on the multi-threaded version. If there were race violations actually occurring, the multi-threaded results would not match exactly the single threaded results. Therefore, cuda-memcheck must be wrong and have bugs dealing with complex looping structures. cuda-memcheck was able to find race conditions in simple looping structures, just not in this complicated one.