effective memory bandwidth from strided memory access

Suppose I have a kernel which performs strided memory access as follows:

__global__ void strideExample (float *outputData, float *inputData, int stride=2) 
{
        int index = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
        outputData[index] = inputData[index]; 
}

I understand that accesses with a stride size of 2 will result in a 50% load/store efficiency, since half of the elements involved in the transaction are not used (becoming wasted bandwidth). How do we proceed to calculate the load/store efficiency for larger stride sizes? Thanks in advance!

Solution

In general:

load efficiency = requested loads / effective loads

Where requested loads is the number of bytes that the software requested to read and effective loads is the number of bytes that the hardware actually had to read. Same formula applies for stores.

Perfectly coalesced accesses have an efficiency of 1.

Your code requests exactly (blockIdx.x * blockDim.x + threadIdx.x) * sizeof(float) bytes. Assuming outputData is correctly aligned (as are pointers returned by cudaMalloc), the hardware will have to read (blockIdx.x * blockDim.x + threadIdx.x) * sizeof(float) * stride bytes, rounded up to the transaction size (128 bytes for SM/L1, 32 bytes for L1/L2).

Assuming your block size is large enough, the rounding to the transaction size becomes negligible and you can simplify the equation to just 1 / stride, which gives in this case a load efficiency of approx ~16.7%.