CUDA atomicAdd failed

The following CUDA kernel is supposed to do image slices addition for an 3D image, i.e., you collapse the 3D volume along one dimension and produce one 2D image through doing pixel-wise additions. The image_in data pointer has size 128 * 128 * 128, which was obtained from an ITK::Image using the function GetOutputBuffer(). After reading the ITK documentation, I think we can safely assume that the data pointer points to an segment of continuous memory of the image data, without padding. The image_out is just a 2D image of size 128 * 128, also produced from an ITK::Image. I included the info about the images just for completeness but the question is more about CUDA atomic and might be very elementary. The code compute the thread id first and project the id into the range of 128 * 128, which means all pixels in the same line along the dimension we perform addition will have the same idx. Then using this idx, atomicAdd was used to update the image_out.

__global__ void add_slices(int* image_in, int* image_out) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    int idx = tid % (128 * 128);
    int temp = image_in[tid];

    atomicAdd( &image_out[idx], temp );

}

The way I initialized the image_out is through the following, there are two ways I tried with the similar results:

int* image_out = new int[128 * 128];
for (...) {
    /* assign image_out to zeros */
}

and the one using ITK interface:

out_image->SetRegions(region2d);
out_image->Allocate();
out_image->FillBuffer(0);
// Obtain the data buffer
int* image_out = out_image->GetOutputBuffer();

Then I setup CUDA as the following:

unsigned int size_in = 128 * 128 * 128;
unsigned int size_out = 128 * 128;
int *dev_in;
int *dev_out;
cudaMalloc( (void**)&dev_in, size_in * sizeof(int) );
cudaMalloc( (void**)&dev_out, size_out * sizeof(int));
cudaMemcpy( dev_in, image_in, size_in * sizeof(int), cudaMemcpyHostToDevice );
add_slices<<<size_in/64, 64 >>>(dev_in, dev_out);
cudaMemcpy( image_out, dev_out, size_out * sizeof(int), cudaMemcpyDeviceToHost);

Is there any problem to the above code? The reason why I am seeking help here comes from the frastration that the above code sometimes might produce the right result (once every 50 times I run the code, maybe, I swear I have seen the correct result at least twice), while the rest of the time just produced some garbages. Does the issue comes from the atomicAdd() function? At the beginning my image type was of double, which CUDA doesn't support atomicAdd(double*, double) so I used the code provided by Nvidia as the following

__device__ double atomicAdd(double* address, double val)
{
    unsigned long long int* address_as_ull =
                                          (unsigned long long int*)address;
    unsigned long long int old = *address_as_ull, assumed;
    do {
        assumed = old;
        old = atomicCAS(address_as_ull, assumed, 
                        __double_as_longlong(val + 
                        __longlong_as_double(assumed)));
    } while (assumed != old);
    return __longlong_as_double(old);
}

Then just for testing purpose I switched all my image to int then the situation was still the same that most of the time garbages while once in a blue moon correct result.

Do I need to turn on some compiling flag? I am using CMAKE to build the project using

find_package(CUDA QUIET REQUIRED)

for the CUDA support. The following is the way I setup the CUDA_NVCC_FLAGS

set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -arch=sm_30"),

maybe I missed something?

Any suggestion will be greatly appreciated and I will update the question if more info of the code is needed.

Solution

So it turns out that the solution to this problem is adding the following line to initialize the memory pointed by dev_out.

cudaMemcpy( dev_out, image_out, size_out * sizeof(int), cudaMemcpyHostToDevice );

I forgot to initialize it since I was thinking that it is a output variable and I initialized it on the host.

Just like that talonmies said, it has nothing to do with atomicAdd at all. Both int version and double version of atomicAdd works perfectly. Just remember to initialize your variable on device.