Summation over one dimension of a three dimensional array using shared memory

I need to do calculation like:

A[x][y] = sum{from z=0 till z=n}{B[x][y][z]+C[x][y][z]}

where matrix A has dimensions [height][width] and tensors B and C have dimensions [height][width][n].

Values are mapped to memory with something like:

index = 0;
for (z = 0; z<n; ++z)
    for(y = 0; y<width; ++y)
        for(x = 0; x<height; ++x) {
            matrix[index] = value;
            index++;
        }

I would like to each block calculate one sum since each block has its own shared memory. To avoid data races I use atomicAdd, something like this:

Part of host code:

dim3 block (n, 1, 1);
dim3 grid (height, width, 1);

Kernel:

atomicAdd( &(A[blockIdx.x + blockIdx.y*gridDim.y]), 
           B[blockIdx.x + blockIdx.y*gridDim.y+threadIdx.x*blockDim.x*blockDim.y] 
           + C[blockIdx.x + blockIdx.y*gridDim.y+threadIdx.x*blockDim.x*blockDim.y] );

I would like to use shared memory for calculating the sum and then copy this result to global memory.

I am not sure how to do the part with shared memory. In each block´s shared memory will be stored just one number ( sum result ). How should I copy this number to the right place in matrix A in global memory?

Solution

You probably don't need shared memory or atomic memory access to do the summation you are asking about. If I have understood this correctly, your data is in column major order, so the logical operation is to have one thread per matrix entry in the output matrix, and have each thread traverse the z axis of the input matrices, summing as they go. The kernel for this could look something like:

__global__ void kernel(float *A, const float *B, const float *C, 
        const int width, const int height, const int n)
{
    int tidx = threadIdx.x + blockDim.x * blockIdx.x;
    int tidy = threadIdx.y + blockDim.y * blockIdx.y;

    if ( (tidx < height) && (tidy < width) ) {
        int stride = width * height;
        int ipos = tidx + tidy * height;

        float * oval = A + ipos;
        float sum = 0.f;
        for(int z=0; z<n; z++, ipos+=stride) {
            sum += B[ipos] + C[ipos];
        }
        *oval = sum;
    }
}

This approach should be optimal for column-major data with width * height >= n. There are no performance advantages to using shared memory for this, and there is no need to use atomic memory operations either. If you had a problem where width * height << n it might make sense to try a block wise parallel reduction per summation. But you have not indicated what the typical dimensions of the problem are. Leave a comment if your problem is more like the latter, and I can add a reduction based sample kernel to the answer.