Does CUDA broadcast shared memory to all threads in a block without a bank conflict?

In the CUDA programming guide, in the shared memory section, it states that shared memory access by the warp is not serialized but broadcasted for reads.

However it doesn't state what happens if the entire block requests the same memory address. Are the accesses between warps serialized or can CUDA broadcast to the whole block.

Demo code for my case

// Assume 1024 sized int array
__global__ add_from_shared(int* i, int* j, int* out)
{
    __shared__ int shmem[1024];
    shmem[threadIdx.x] = i[threadIdx.x];
    ...
    Do some stuff
    ...
    // Is the shared memory call here serilized between warps or is it a broadcast over the entire block?
    j[threadIdx.x] += shmem[0];
}

Thanks

Solution

Shared memory bank conflicts are only relevant for threads within a warp, on a particular instruction/cycle. All instructions in the GPU are issued warp-wide. They are not issued to all warps in threadblock, from a single warp scheduler, in the same cycle.

There is no such concept as shared memory bank conflicts between threads in different warps, nor is there any concept of shared memory bank conflicts between threads that are executing different issued instructions.

The warp scheduler will issue the shared read instructions (LDS) to each warp individually. Depending on the access pattern evident among threads in that warp, for that issued instruction, bank conflicts may or may not occur. There are no bank conflicts possible between threads of one warp and threads of another warp.

There is likewise no broadcast mechanism that extends beyond a warp.

All instructions in the GPU are issued per warp.

If all threads in a block read the same address, then the warp scheduler will issue that instruction to one warp, and for the threads in that warp, broadcast will apply. At the same time or at a different time, from the same warp scheduler or a different warp scheduler, the same instruction (i.e. from the same point in the instruction stream) will be issued to another warp. Broadcast will happen within that reqest. Repeat for as many warps in the threadblock.

Your code doesn't contain atomics, or shared memory writes to the same location, and almost nothing I've said here pertains to atomics. atomics are either warp-aggregated or serialized by the atomic handling mechanism, and multiple (non-atomic) writes to the same location lead to undefined behavior. You can expect that one of the writes will show up in that location, but which one is undefined. From a performance perspective, I don't know of any statements about same-location-shared-write performance. And from a performance perspective, atomics are a completely different animal.