I'm in the rather poor situation of not being able to use the CUDA debugger. I'm getting some strange results from usage of __syncthreads in an application with a single shared array (deltas). The following piece of code is performed in a loop:
__syncthreads(); //if I comment this out, things get funny
deltas[lex_index_block] = intensity - mean;
__syncthreads(); //this line doesnt seem to matter regardless if the first sync is commented out or not
//after sync: do something with the values of delta written in this threads and other threads of this block
Basically, I have code with overlapping blocks (required due to the nature of the algorithm). The program does compile and run but somehow I get systematically wrong values in the areas of vertical overlap. This is very confusing to me as I thought that the correct way to sync is to sync after the threads have performed my write to the shared memory.
This is the whole function:
//XC without repetitions
template <int blocksize, int order>
__global__ void __xc(unsigned short* raw_input_data, int num_frames, int width, int height,
float * raw_sofi_data, int block_size, int order_deprecated){
//we make a distinction between real pixels and virtual pixels
//real pixels are pixels that exist in the original data
//overlap correction: every new block has a margin of 3 threads doing less work (only computing deltas)
int x_corrected = global_x() - blockIdx.x * 3;
int y_corrected = global_y() - blockIdx.y * 3;
//if the thread is responsible for any real pixel
if (x_corrected < width && y_corrected < height){
// __shared__ float deltas[blocksize];
__shared__ float deltas[blocksize];
//the outer pixels of a block do not update SOFI values as they do not have sufficient information available
//they are used only to compute mean and delta
//also, pixels at the global edge have to be thrown away (as there is not sufficient data to interpolate)
bool within_inner_block =
threadIdx.x > 0
&& threadIdx.y > 0
&& threadIdx.x < blockDim.x - 2
&& threadIdx.y < blockDim.y - 2
//global edge
&& x_corrected > 0
&& y_corrected > 0
&& x_corrected < width - 1
&& y_corrected < height - 1
;
//init virtual pixels
float virtual_pixels[order * order];
if (within_inner_block){
for (int i = 0; i < order * order; ++i) {
virtual_pixels[i] = 0;
}
}
float mean = 0;
float intensity;
int lex_index_block = threadIdx.x + threadIdx.y * blockDim.x;
//main loop
for (int frame_idx = 0; frame_idx < num_frames; ++frame_idx) {
//shared memory read and computation of mean/delta
intensity = raw_input_data[lex_index_3D(x_corrected,y_corrected, frame_idx, width, height)];
__syncthreads(); //if I comment this out, things break
deltas[lex_index_block] = intensity - mean;
__syncthreads(); //this doesnt seem to matter
mean = deltas[lex_index_block]/(float)(frame_idx+1);
//if the thread is responsible for correlated pixels, i.e. not at the border of the original frame
if (within_inner_block){
//WORKING WITH DELTA STARTS HERE
virtual_pixels[0] += deltas[lex_index_2D(
threadIdx.x,
threadIdx.y + 1,
blockDim.x)]
*
deltas[lex_index_2D(
threadIdx.x,
threadIdx.y - 1,
blockDim.x)];
virtual_pixels[1] += deltas[lex_index_2D(
threadIdx.x,
threadIdx.y,
blockDim.x)]
*
deltas[lex_index_2D(
threadIdx.x + 1,
threadIdx.y,
blockDim.x)];
virtual_pixels[2] += deltas[lex_index_2D(
threadIdx.x,
threadIdx.y,
blockDim.x)]
*
deltas[lex_index_2D(
threadIdx.x,
threadIdx.y + 1,
blockDim.x)];
virtual_pixels[3] += deltas[lex_index_2D(
threadIdx.x,
threadIdx.y,
blockDim.x)]
*
deltas[lex_index_2D(
threadIdx.x+1,
threadIdx.y+1,
blockDim.x)];
// xc_update<order>(virtual_pixels, delta2, mean);
}
}
if (within_inner_block){
for (int virtual_idx = 0; virtual_idx < order*order; ++virtual_idx) {
raw_sofi_data[lex_index_2D(x_corrected*order + virtual_idx % order,
y_corrected*order + (int)floorf(virtual_idx / order),
width*order)]=virtual_pixels[virtual_idx];
}
}
}
}
From what I can see, there could be a hazard in your application between loop iterations. The write to deltas[lex_index_block]
for loop iteration frame_idx+1
could be mapped to the same location as the read of deltas[lex_index_2D(threadIdx.x, threadIdx.y -1, blockDim.x)]
in a different thread at iteration frame_idx
. The two accesses are unordered and the result is nondeterministic. Try running the app with cuda-memcheck --tool racecheck
.