CUDA memory bank conflict

I would like to be sure that I correctly understand bank conflicts in shared memory.

I have 32 segments of data.

These segments consist of 128 integers each.

[[0, 1, ..., 126, 127], [128, 129, ..., 255], ..., [3968, 3969, ..., 4095]]

Each thread in a warp accesses only its own portion.

Thread 0 accesses position 0 of portion 0 corresponding to index 0.
Thread 1 accesses position 0 on portion 1 corresponding to index 128.
...
Thread 31 accesses position 0 of portion 31 corresponding to index 3968.

Does it mean that I have a 32-fold bank conflict?

If yes, then if I add one element of padding to each segment (i.e. 129 elements total), then each thread will access a unique bank. Am I right?

Solution

Yes, you will have 32-way bank conflicts. For the purposes of bank conflicts, it may help to visualize shared memory as a two-dimensional array, whose width is 32 elements (e.g. 32 int or float quantities, for example). Each column in this 2D array is a "bank".

Overlay your storage pattern on that. When you do so, you will see that your stated access pattern will result in all threads in the warp will be requesting items from column 0.

Yes, the usual "trick" here is to pad the storage by 1 element per "row" (in your case this could be one element per "portion"). That should eliminate bank conflicts for your stated access pattern.