Search code examples
cudagpugpu-shared-memorybank-conflict

CUDA shared memory bank conflicts report higher


I've been working on optimizing some code and ran into an issue with the shared memory bank conflict report from the CUDA Nsight performance analysis. I was able to reduce it to a very simple piece of code that Nsight reports as having a bank conflict, when it doesn't seem there should be one. Below is the kernel:

__global__ void conflict() {
    __shared__ double values[33];
    values[threadIdx.x] = threadIdx.x;
    values[threadIdx.x+1] = threadIdx.x;
}

And the main function to call it:

int main() {
    conflict<<<1,32>>>();
}

Note that I am using a single warp to really reduce this to the bare minimum. When I run the code, Nsight says there is 1 bank conflict, but according to everything I have read, there should not be any. For each access to the shared memory array, each thread is accessing consecutive values, each belonging to separate banks.

Has anyone else experienced issues with the reporting of Nsight or am I just missing something with the functioning of bank conflicts? I would appreciate any feedback!

Btw, I am running the following setup:

  • Windows 8
  • GTX 770
  • Visual Studio Community 2013
  • CUDA 7
  • Nsight Visual Studio Edition Version 4.5

Solution

  • If the intent is to run the posted code as-is, with double data type, and no bank conflicts, I believe it's possible with appropriate use of cudaDeviceSetSharedMemConfig (on cc3.x devices). Here's a test case:

    $ cat t750.cu
    #include <stdio.h>
    
    typedef double mytype;
    
    
    template <typename T>
    __global__ void conflict() {
        __shared__ T values[33];
        values[threadIdx.x] = threadIdx.x;
        values[threadIdx.x+1] = threadIdx.x;
    }
    
    int main(){
    
    #ifdef EBM
      cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeEightByte);
    #endif
    
      conflict<mytype><<<1,32>>>();
      cudaDeviceSynchronize();
    }
    
    $ nvcc -arch=sm_35 -o t750 t750.cu
    t750.cu(8): warning: variable "values" was set but never used
              detected during instantiation of "void conflict<T>() [with T=mytype]"
    (19): here
    
    $ nvprof --metrics shared_replay_overhead ./t750
    ==46560== NVPROF is profiling process 46560, command: ./t750
    ==46560== Profiling application: ./t750
    ==46560== Profiling result:
    ==46560== Metric result:
    Invocations                               Metric Name                        Metric Description         Min         Max         Avg
    Device "Tesla K40c (0)"
     Kernel: void conflict<double>(void)
              1                    shared_replay_overhead             Shared Memory Replay Overhead    0.142857    0.142857    0.142857
    $ nvcc -arch=sm_35 -DEBM -o t750 t750.cu
    t750.cu(8): warning: variable "values" was set but never used
              detected during instantiation of "void conflict<T>() [with T=mytype]"
    (19): here
    
    $ nvprof --metrics shared_replay_overhead ./t750
    ==46609== NVPROF is profiling process 46609, command: ./t750
    ==46609== Profiling application: ./t750
    ==46609== Profiling result:
    ==46609== Metric result:
    Invocations                               Metric Name                        Metric Description         Min         Max         Avg
    Device "Tesla K40c (0)"
     Kernel: void conflict<double>(void)
              1                    shared_replay_overhead             Shared Memory Replay Overhead    0.000000    0.000000    0.000000
    $
    

    With specification of EightByteMode, the shared memory replay overhead is zero.