Suppose I have a kernel which performs strided memory access as follows:
__global__ void strideExample (float *outputData, float *inputData, int stride=2)
{
int index = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
outputData[index] = inputData[index];
}
I understand that accesses with a stride size of 2 will result in a 50% load/store efficiency, since half of the elements involved in the transaction are not used (becoming wasted bandwidth). How do we proceed to calculate the load/store efficiency for larger stride sizes? Thanks in advance!
In general:
load efficiency = requested loads / effective loads
Where requested loads
is the number of bytes that the software requested to read and effective loads
is the number of bytes that the hardware actually had to read. Same formula applies for stores.
Perfectly coalesced accesses have an efficiency of 1.
Your code requests exactly (blockIdx.x * blockDim.x + threadIdx.x) * sizeof(float)
bytes. Assuming outputData
is correctly aligned (as are pointers returned by cudaMalloc
), the hardware will have to read (blockIdx.x * blockDim.x + threadIdx.x) * sizeof(float) * stride
bytes, rounded up to the transaction size (128 bytes for SM/L1, 32 bytes for L1/L2).
Assuming your block size is large enough, the rounding to the transaction size becomes negligible and you can simplify the equation to just 1 / stride
, which gives in this case a load efficiency of approx ~16.7%.