Cuda accumulate lines of an image

I have to accumulate the lines of an image stored in an array efficiently.

I've come up with a real naive solution but I'm pretty sure there's a much better way to do it with cuda.

__global__
void Accumulate(double *x, double *y, int height, int width)
{
    int i = blockIdx.x*blockDim.x + threadIdx.x;

    if (i >= width)
        return;
    for (int j = 0; j < height; j++)
    {
        y[i] += x[j*width+ i];
    }

}

What would be an elegant solution to this problem ?

Solution

General Tips

__global__
void Accumulate(float *x, float *y, int height, int width)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;

    if (i >= width) return;
    float sum = 0;
    for (int j = 0; j < height; j++)
    {
        sum += x[i*width + j];
    }
    y[i] = sum;
}

This is a suggestion of how the code could look after applying some improvements.

Replaced double by float. This is much faster on most graphics cards. This is not possible if you really need the double precision. Often that's not the case though.
Save the sum in a temporary variable sum before writing it into the array y which is located in global memory. Accesses to global memory are very expensive, try to avoid them whenever possible.

These are a few tips that might help you to speed up your code on the GPU.

Finding the Optimal Block Size

What also has an impact on performance is the block size. There is no optimal recommendation here. However, the total amount of threads in one block should always be divisible by 32. This is the size of one warp.