CUDA coalescing and global memory

I have been told in my CUDA course that the following access (global memory) is coalescaled if elements of my "a" array have a size of 4,8 or 16 bytes.

int i = blockIdx.x*blockDim.x + threadIdx.x;
a[i];

The 2 conditions for coalescing are : Threads of the warp must access a chunk of 32, 64 or 128 bytes. Warp's first thread must be accessing an address which is a multiple of 32, 64 or 128

But in this example(first condition), nothing guarantees that the warp will access a chunk of 32 bytes.

If I assume that a's elements are floats (4 bytes), and if I define blockDim.x as 5, then every warp will access chunks of 20 (4x5) bytes even though elements of my "a" array have a size of 4,8 or 16 bytes, and not 32. So, is the very first claim about coalescing false ?

Thank you for your answer.

Solution

But in this example(first condition), nothing guarantees that the warp will access a chunk of 32 bytes.

Because of thread ordering, it guarantees that each warp accesses 128 bytes (32 threads x 4 bytes). Which is a necessary condition of coalesced memory access.

If I assume that a's elements are floats (4 bytes), and if I define blockDim.x as 5, then every warp will access chunks of 20 (4x5) bytes even though elements of my "a" array have a size of 4,8 or 16 bytes, and not 32.

Warps are always 32 threads. If you define blockDim.x as 5, each block will consist of 1 warp with 27 nulled threads. The coalescing rules will still apply and transactions will be coalesced, but you are wasting 27/32 of your potential computational capacity and memory bandwidth.

So, is the very first claim about coalescing false ?

No.