why does a thread accessing two contiguous elements cause a "bank conflict"?

bank conflict

As shown in the red box above, I don't understand why a thread accessing two arrays of data consecutively would cause bank conflict, but the following access, as shown below, would not cause conflict.

no bank conflict

thanks for your answer!!!

Solution

https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/

Shared memory bank conflicts

To achieve high memory bandwidth for concurrent accesses, shared memory is divided into equally sized memory modules (banks) that can be accessed simultaneously. Therefore, any memory load or store of n addresses that spans b distinct memory banks can be serviced simultaneously, yielding an effective bandwidth that is b times as high as the bandwidth of a single bank.

However, if multiple threads’ requested addresses map to the same memory bank, the accesses are serialized. The hardware splits a conflicting memory request into as many separate conflict-free requests as necessary, decreasing the effective bandwidth by a factor equal to the number of colliding memory requests. An exception is the case where all threads in a warp address the same shared memory address, resulting in a broadcast. Devices of compute capability 2.0 and higher have the additional ability to multicast shared memory accesses, meaning that multiple accesses to the same location by any number of threads within a warp are served simultaneously.

Let's assume that there are 8 memory banks of size 4 bytes for your example of parallel reduction. Element i is served by bank i % 8.

Then, banks 0,2,4,6 need to serve two requests in the first example.

In the second example, each bank needs to serve only one request.