Subtraction and multiplication of an array with compute-bound in CUDA kernel

Recently I met an issue with CUDA programming.

I have an array a, and I want to do the inter-element subtraction, like a[0]-a[2], a[1]-a[3], ..., and so on. Later, I need to multiply these results, or in other words, like this: (a[0]-a[2])(a[1]-a[3]), (a[4]-a[6])*(a[5]-a[7]),... and so on. All above instructions should happen in GPU kernel(s).

So far, my kernels can give me the correct result, and looks like these:

__global__ void subtractKernel(short* a, __int64 numElements)
{
    int index = blockDim.x * blockIdx.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
#pragma unroll
    for (int i = index; i < numElements / 4; i += stride)
    {
        a[i * 4] = (a[i * 4] - a[i * 4 + 2]);
        a[i * 4 + 1] = (a[i * 4 + 1] - a[i * 4 + 3]);
    }

}

__global__ void multiplyKernel(short* a, int* dev_a,  __int64 numElements)
{
    int index = blockDim.x * blockIdx.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
#pragma unroll
    for (int i = index; i < numElements /4; i+=stride)
    {
        dev_a[i] = (int)a[i * 4 ] * (int)a[i * 4 + 1];
    }

}

However, the efficiency is too bad. There are too many I/O in subtractKernel and Compute throughput is only 3.15 % (whereas memory throughput is 47 %). I know I should use shared memory to complete this task, but I don't know how to perform it. Could anyone help me with this? Or any other thoughts? Thanks

It's not like reduction-type of problem, so I think I don't need to do the whole warp level reduction. Since I am new to CUDA, I don't know how to extend this reduction notion to my case.

Solution

It is normal to get a very-low use of the compute units and high use of the memory bandwidth because this task has a very-low arithmetic intensity (1 integer operation for 2 x 16-bit items, that is 0.25 not to mention the stores). GPU are optimized for a much higher arithmetic intensity like at least >10. You can benefit from their high memory bandwidth though.

However, not saturating the RAM is an issue. This is certainly due to the memory access which are not coalesced here. You can load data and shuffle items in warps before computing the subtraction. However, half the threads must be disabled for the stores using basic conditions (and possibly the subtraction but I think it does not matter). It should be enough to saturate memory. I do not think shared memory is useful here with this trick (though it could be used to do this operation I it should be less efficient than the trick because shared memory access are more expensive than register accesses).

Doing two kernel is inefficient here : it forces you to store data in memory so to then read it again and all of this with an inefficient memory access pattern. It is better to do the multiplication in the same kernel, still thanks to shuffle (and conditions so to disable threads). Only 1/4 of the thread will work for the second part which is not great, but far better than reading/storing data in memory. You should avoid reading/storing data in memory on GPU as much as possible if you want your kernel to be fast. In fact the same think tends to be true on CPU too.

Warp shuffle functions are detailed in the CUDA manual.