Search code examples
c++cudagpuprofilingnsight

NSight Compute - expecting bank conflicts but not detecting any


I was trying to detect shared memory bank conflicts for matrix transposition kernels. The first kernel performs matrix transposition without padding, and hence should have bank conflicts, while the second kernel uses padding, and should not have bank conflicts.

However, profiling with NSight Compute in the memory workload section shows 0 bank conflicts for both kernels.

Shared memory workload statistics, baseline being the kernel without padding

I implemented the kernels as device functions like so

// tiled, with padding (expecting no bank conflicts)
template <class value_type, class container_type = value_type*>
__device__
void
transpose_padded(container_type m1, container_type m2, size_t width)
{
    __shared__ value_type tile[BLOCK_WIDTH][BLOCK_WIDTH+1];
    // BLOCK_WIDTH = 32, global scope constant
    auto row = blockDim.y*blockIdx.y + threadIdx.y;
    auto col = blockDim.x*blockIdx.x + threadIdx.x;
    auto index = row * width + col;

    auto tr_row = blockDim.y * blockIdx.x + threadIdx.y;
    auto tr_col = blockDim.x * blockIdx.y + threadIdx.x;
    auto tr_index = tr_row * width + col;

    auto local_x = threadIdx.x;
    auto local_y = threadIdx.y;
    tile[local_x][local_y] = m1[index];
    __syncthreads();
    if (tr_row < width && tr_col < width)
    {
        m2[tr_index] = tile[local_y][local_x];
    }
    
    return;
}
// tiled, without padding (expecting bank conflicts)
template <class value_type, class container_type = value_type*>
__device__
void
transpose_tiled(container_type input, container_type output, size_t width)
{
    // assuming square blocks
    extern __shared__ value_type input_tile[];
    auto row = blockDim.y*blockIdx.y + threadIdx.y;
    auto col = blockDim.x*blockIdx.x + threadIdx.x;
    auto matrix_index = row*width + col;

    auto tr_row = col;
    auto tr_col = row;
    auto tr_index = tr_row*width + tr_col;
    
    // coalesced global memory access
    auto shared_index = threadIdx.y*blockDim.x+threadIdx.x;
    input_tile[shared_index]= input[matrix_index];
    __syncthreads();
    if (tr_row < width && tr_col < width)
        output[tr_index] = input_tile[shared_index];
    return;
}

The input matrix that I used had dimensions 100x100. In both kernels, the block sizes are 32x32 threads. The instantiations have value type double.

Are there really no bank conflicts, or is this caused by something else altogether? What other information from other sections can I use to determine whether there could be bank conflicts or not?


Solution

  • For a block dimension of 32x32, I wouldn't expect either kernel to demonstrate bank conflicts. Bank conflicts are covered in many resources including many questions here on the cuda tag, so I shall summarize briefly.

    Bank conflicts arise when two or more threads in the same warp (and during the same instruction) are doing a shared load, or a shared store, where the locations referenced by those two threads are in the same bank but not the same location.

    A bank can be roughly described as a column in shared memory, when the shared memory is thought of as a 2D array with a width of 32 banks times 32-bit quantities per bank, i.e. a width of 128 bytes.

    Those definitions should provide a fairly complete understanding and cover the majority of cases of interest. We can derive one observation from those, namely that the same access pattern (adjacent threads accessing adjacent elements in memory) that works nicely for global memory coalesced loads/stores also works nicely to avoid bank conflicts. (It's not the only pattern that works nicely for shared memory, but it is one canonical pattern.)

    Turning to your codes, then:

    1. You've already indicated (correctly) that you don't expect shared bank conflicts on the first code. The shared load in that code:

       = tile[local_y][local_x];
      

      has threadIdx.x (or an index that includes threadIdx.x without any multiplicative factors on it) as the last subscript, which is a canonical pattern in CUDA for "nice" access. It indicates that adjacent threads will read from adjacent locations in memory. This works well for both global and shared memory.

      For the shared store:

      tile[local_x][local_y] = 
      

      at first glance this would appear to be "columnar" access across the warp, canonically bad for CUDA (whether global or shared) but you are using the shared memory offset-the-columns-by-1 trick:

      __shared__ value_type tile[BLOCK_WIDTH][BLOCK_WIDTH+1];
                                                         ^^
      

      so that situation is handled/sorted also. No bank conflicts expected here, for 32x32 block config (all 32 threads in each warp will have monotonically increasing threadIdx.x and constant threadIdx.y).

    2. For the second code, there is only one indexing pattern used for both shared store and shared load:

      input_tile[shared_index]=
      = input_tile[shared_index];
      

      which is:

      auto shared_index = threadIdx.y*blockDim.x+threadIdx.x;
      

      Therefore, to answer the question of bank conflicts in this case, we need only study one access pattern. Let's see if we can take the same shortcut. Does the indexing pattern include threadIdx.x with no multiplicative factors on it (in the last subscript)? Yes. Therefore, adjacent threads in the warp will access adjacent locations in memory, and this is a canonically good pattern, i.e. no bank conflicts.