Will the same thread accessing the same memory bank twice cause conflicts?

I am working on a kernel that does a vector reduction. It basically adds up all the positions in the vector and stores the result in position 0.

I'm following this scheme, with blocks of 512 float elements:

reduction scheme

The code:

//scratch[] is a vector located in shared memory with all 512 elements
NUM_ELEMENTS = 512;
for( stride=NUM_ELEMENTS/2; stride>=1; stride = stride/2 ) {
  if (threadIdx.x < stride){
    scratch[threadIdx.x] += scratch[threadIdx.x + stride];
  }
  __syncthreads();
}

The odd thing is, I'm expecting to get shared bank conflicts and I'm not. In the first iteration, thread 0 is adding up position 0 and position 256, which reside in the same bank. Thread 1 is adding up position 1 and position 257, and so on.

All of these operations require each thread in the warp to obtain 2 distinct values from the same bank, yet, I get no conflicts whatsoever:

result

What am I missing?

Solution

The calculation for bank conflicts is on a per memory instruction per request basis. The shared load (right hand side) and the shared store (left hand side) are executed as separate instructions many clock cycles apart.