Is it coalesced, if n < warpSize
?
// In kernel
int x;
if (threadId < n)
x = globalMem[threadId];
Such situation appers in the last iteration of the cycle, if some N
is indivisible by warpSize
. Should I run about these sitatuations and alloc device memory only divisible by warpSize
or it's coalesced as it is?
If threadId
is computed correctly as documented in cuda programming guide - thread hierachy, than this access will be coalesced - this will be the case for threadId = threadIdx.x
.
For the different compute architectures memory coalescing differs slightly. More details can be found at appendix G of cuda programming guide.
In general you can say that global memory accesses are coalesced, if your threads grab consecutive elements in the memory, starting from the address of the element that your first thread access.
Let's assume you have a float array.
float array[]
and your memory acces looks in that way
array[threadIdx.x == 0, threadId.x == 1, threadIdx.x == 2, ..., threadIdx.x == 31]
than your access will be coalesed.
But if you access memory in that way (interleaved for example)
array[threadIdx.x == 0, NONE, threadId.x == 1, NONE, threadIdx.x == 2, ..., NONE, threadIdx.x == 31]
than your access isn't coalesced (NONE
means that this array element isn't access by any thread)
In the first case you grab 128 consecutive bytes memory. In the second case you grab 256 bytes. For the second case two warps are required to load the memory from global memory instead of one warp for the first case. But in both cases only 32 float elements (that are 128 bytes) are required for the following computations. So your global load-rate will drop from 1.0 to 0.5 in that simple case.