Search code examples
cudagpu-shared-memorybank-conflict

Expected number of bank conflicts in shared memory at random access


Let A be a properly aligned array of 32-bit integers in shared memory.

If a single warp tries to fetch elements of A at random, what is the expected number of bank conflicts?

In other words:

__shared__ int A[N];          //N is some big constant integer
...
int v = A[ random(0..N-1) ];  // <-- expected number of bank conflicts here?

Please assume Tesla or Fermi architecture. I don't want to dwell into 32-bit vs 64-bit bank configurations of Kepler. Also, for simplicity, let us assume that all the random numbers are different (thus no broadcast mechanism).

My gut feeling suggests a number somewhere between 4 and 6, but I would like to find some mathematical evaluation of it.


I believe the problem can be abstracted out from CUDA and presented as a math problem. I searched it as an extension to Birthday Paradox, but I found really scary formulas there and didn't find a final formula. I hope there is a simpler way...


Solution

  • I assume fermi 32-bank shared memory where each 4 consequent bytes are stored in consequent banks. Using following code:

    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    #define NBANK 32
    #define N 7823
    #define WARPSIZE 32
    
    
    #define NSAMPLE 10000
    
    int main(){
        srand ( time(NULL) );
    
        int i=0,j=0;
        int *conflictCheck=NULL;
        int *randomNumber=NULL;
        int *statisticCheck=NULL;
    
        conflictCheck=(int*)malloc(sizeof(int)*NBANK);
        randomNumber=(int*)malloc(sizeof(int)*WARPSIZE);
        statisticCheck=(int*)malloc(sizeof(int)*(NBANK+1));
        while(i<NSAMPLE){
            // generate a sample warp shared memory access
            for(j=0; j<WARPSIZE; j++){
                randomNumber[j]=rand()%NBANK;
            }
            // check the bank conflict
            memset(conflictCheck, 0, sizeof(int)*NBANK);
            int max_bank_conflict=0;
            for(j=0; j<WARPSIZE; j++){
                conflictCheck[randomNumber[j]]++;
                max_bank_conflict = max_bank_conflict<conflictCheck[randomNumber[j]]? conflictCheck[randomNumber[j]]: max_bank_conflict;
            }
            // store statistic
            statisticCheck[max_bank_conflict]++;
    
            // next iter
            i++;
        }
        // report statistic
        printf("Over %d random shared memory access, there found following precentages of bank conflicts\n");
        for(i=0; i<NBANK+1; i++){
            //
            printf("%d -> %6.4f\n",i,statisticCheck[i]/(float)NSAMPLE);
        }
        return 0;
    }
    

    I got following output:

    Over 0 random shared memory access, there found following precentages of bank conflicts
    0 -> 0.0000
    1 -> 0.0000
    2 -> 0.0281
    3 -> 0.5205
    4 -> 0.3605
    5 -> 0.0780
    6 -> 0.0106
    7 -> 0.0022
    8 -> 0.0001
    9 -> 0.0000
    10 -> 0.0000
    11 -> 0.0000
    12 -> 0.0000
    13 -> 0.0000
    14 -> 0.0000
    15 -> 0.0000
    16 -> 0.0000
    17 -> 0.0000
    18 -> 0.0000
    19 -> 0.0000
    20 -> 0.0000
    21 -> 0.0000
    22 -> 0.0000
    23 -> 0.0000
    24 -> 0.0000
    25 -> 0.0000
    26 -> 0.0000
    27 -> 0.0000
    28 -> 0.0000
    29 -> 0.0000
    30 -> 0.0000
    31 -> 0.0000
    32 -> 0.0000
    

    We can come to conclude that 3 to 4 way conflict is the most likely with random access. You can tune the run with different N (number of elements in array), NBANK (number of banks in shared memory), WARPSIZE (warp size of machine), and NSAMPLE (number of random shared memory accesses generated to evaluate the model).