Relevance of shared memory bank conflicts in Fermi and higher

From what I read in the CUDA documentation, shared memory bank conflicts are irrelevant on sm_20 and higher because values are broadcasted when they are requested simultaneously, preventing any sort of serialization delays.

The documentation:

The shared memory hardware is improved on devices of compute capability 2.x to support multiple broadcast words and to generate fewer bank conflicts for accesses of 8-bits, 16-bits, 64-bits, or 128-bits per thread (Section G.4.3).

Can someone confirm my assertion?

Solution

No they are not "irrelevant".

I believe your confusion may be arising from a common misconception with bank conflicts that "bank" is somehow equal to "location". There is a relationship between bank and location, but it is not necessarily one of equality.

To take a simplified example, suppose we had 4 banks (and let's limit the discussion to 32-bit transactions, and naturally aligned 32 bit storage, e.g. int or float). The relationship between banks and locations (int or float index "addresses") is as follows:

address:  bank:
       0      0 <-----------------------Thread 0
       1      1
       2      2     ------Thread 1
       3      3    /
       4      0 <---------Thread 2
       5      1
       6      2 
       7      3
       8      0 <-----------------------Thread 3
...

We see that addresses 1 and 5, for example, are in the same bank, but they are not the same location.

Bank conflicts (on any architecture) can arise whenever two or more threads in a warp are attempting to access data in the same bank as a result of a particular warp transaction (e.g. read from shared memory).

In the pre-fermi case, even if multiple threads read from the same location (i.e. address), this was a bank conflict, as those threads were reading from the same bank.

In the cc2.x or greater case, a broadcast mechanism was introduced. This mechanism has no effect on the general case of bank conflicts, except for one specific case. When multiple threads read from the same location, this is no longer a bank conflict, and all threads reading from that location will receive the data in a particular cycle without serialization.

However, under any circumstances, if multiple threads read from separate locations that are in the same bank, that is a bank conflict, under any current GPU architecture.

In the above picture, if Thread 0 reads from location/address 0, and Thread 3 reads from location/address 8 that will always be a bank conflict on any current architecture (given that this is a simplified example with only 4 banks). If Thread 1 and Thread 2 both read from location/address 4, that is a bank conflict on pre-fermi but not on all fermi devices and beyond.

For a 32 bank arrangement, which is an actual bank configuration, the bank of any location in shared memory is given by the lower 5 bits of the index or offset to that location, regardless of whether that location happens to belong to a int or float array.