We run the cuda-memcheck --tool racecheck <executable>
on our code. We get the following memory hazard errors.
========= Race reported between Read access at 0x00004098 CUDA.cu:123:KernelFunction()
========= and Write access at 0x00005058 in CUDA.cu:146:KernelFunction() [529996 hazards]
Here's the code. It claims that line 123 value = sharedMemory0[sharedMemoryIndex];
is in a race condition with line 146 sharedMemory0[sharedIndex0] = sharedMemory1[sharedIndex1];
. We have
// Synchronization Point 1
__syncthreads();
__threadfence_block();
between the two lines. Shouldn't all the threads synchronize at that point and all the previous memory read/writes complete at that point? All the threads and memory accesses should complete after the first j-loop before starting the second j-loop. So in our minds Synchronization Point 1 should isolate the two j-loops and prevent a race condition, but the tool says that's not true.
Why is the tool reporting a race condition? Any insights as to what we could do to prevent it?
We've also seen references to a tool that might be able to report a trace of the execution to more easily see the race condition. What tool and options can we use to get a trace to see more clearly why the race condition exists?
for (i = 0; i < COUNT0; i++) {
// Synchronization Point 0
__syncthreads();
__threadfence_block();
for (j = 0; j < COUNT1; j++) {
index = j*blockDim.x + threadIdx.x;
if (index < THREAD_COUNT0) {
for (k = 0; k < COUNT2; k++)
sharedMemoryIndex = function0(index);
value = sharedMemory0[sharedMemoryIndex];
}
}
}
// Synchronization Point 1
__syncthreads();
__threadfence_block();
for (j = 0; j < COUNT2; j++) {
index = j*blockDim.x + threadIdx.x;
if (index < THREAD_COUNT1) {
sharedIndex0 = function1(index);
sharedIndex1 = function2(index);
sharedMemory0[sharedIndex0] = sharedMemory1[sharedIndex1];
}
}
}
We've also run the Synccheck tool, cuda-memcheck --tool synccheck <executable>
and it reported the following error on Synchronization Point 1. There's probably a strong correlation between the two errors, but there isn't very much documentation in the cuda-memcheck guide about what synchronization of divergent code is, why it's bad, and how to fix it.
Any comments?
========= Barrier error detected. Encountered barrier with divergent threads in block
========= at 0x00004ad8 in CUDA.cu:139:KernelFunction()
========= by thread (0,0,0) in block (8,0,0)
This code gives the same results independently of the number of threads used to perform the calculations. We ran the code with only a single thread and then ran the code with multiple threads. By definition, it's not possible for a single threaded version to run into a race condition. And yet, the single threaded version gave identical results to the multi-threaded version. cuda-memcheck --tool racecheck
reports many, many race violations on the multi-threaded version. If there were race violations actually occurring, the multi-threaded results would not match exactly the single threaded results. Therefore, cuda-memcheck
must be wrong and have bugs dealing with complex looping structures. cuda-memcheck
was able to find race conditions in simple looping structures, just not in this complicated one.