Expected number of bank conflicts in shared memory at random access

Let A be a properly aligned array of 32-bit integers in shared memory.

If a single warp tries to fetch elements of A at random, what is the expected number of bank conflicts?

In other words:

__shared__ int A[N];          //N is some big constant integer
...
int v = A[ random(0..N-1) ];  // <-- expected number of bank conflicts here?

Please assume Tesla or Fermi architecture. I don't want to dwell into 32-bit vs 64-bit bank configurations of Kepler. Also, for simplicity, let us assume that all the random numbers are different (thus no broadcast mechanism).

My gut feeling suggests a number somewhere between 4 and 6, but I would like to find some mathematical evaluation of it.

I believe the problem can be abstracted out from CUDA and presented as a math problem. I searched it as an extension to Birthday Paradox, but I found really scary formulas there and didn't find a final formula. I hope there is a simpler way...

Solution

I assume fermi 32-bank shared memory where each 4 consequent bytes are stored in consequent banks. Using following code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define NBANK 32
#define N 7823
#define WARPSIZE 32


#define NSAMPLE 10000

int main(){
    srand ( time(NULL) );

    int i=0,j=0;
    int *conflictCheck=NULL;
    int *randomNumber=NULL;
    int *statisticCheck=NULL;

    conflictCheck=(int*)malloc(sizeof(int)*NBANK);
    randomNumber=(int*)malloc(sizeof(int)*WARPSIZE);
    statisticCheck=(int*)malloc(sizeof(int)*(NBANK+1));
    while(i<NSAMPLE){
        // generate a sample warp shared memory access
        for(j=0; j<WARPSIZE; j++){
            randomNumber[j]=rand()%NBANK;
        }
        // check the bank conflict
        memset(conflictCheck, 0, sizeof(int)*NBANK);
        int max_bank_conflict=0;
        for(j=0; j<WARPSIZE; j++){
            conflictCheck[randomNumber[j]]++;
            max_bank_conflict = max_bank_conflict<conflictCheck[randomNumber[j]]? conflictCheck[randomNumber[j]]: max_bank_conflict;
        }
        // store statistic
        statisticCheck[max_bank_conflict]++;

        // next iter
        i++;
    }
    // report statistic
    printf("Over %d random shared memory access, there found following precentages of bank conflicts\n");
    for(i=0; i<NBANK+1; i++){
        //
        printf("%d -> %6.4f\n",i,statisticCheck[i]/(float)NSAMPLE);
    }
    return 0;
}

I got following output:

Over 0 random shared memory access, there found following precentages of bank conflicts
0 -> 0.0000
1 -> 0.0000
2 -> 0.0281
3 -> 0.5205
4 -> 0.3605
5 -> 0.0780
6 -> 0.0106
7 -> 0.0022
8 -> 0.0001
9 -> 0.0000
10 -> 0.0000
11 -> 0.0000
12 -> 0.0000
13 -> 0.0000
14 -> 0.0000
15 -> 0.0000
16 -> 0.0000
17 -> 0.0000
18 -> 0.0000
19 -> 0.0000
20 -> 0.0000
21 -> 0.0000
22 -> 0.0000
23 -> 0.0000
24 -> 0.0000
25 -> 0.0000
26 -> 0.0000
27 -> 0.0000
28 -> 0.0000
29 -> 0.0000
30 -> 0.0000
31 -> 0.0000
32 -> 0.0000

We can come to conclude that 3 to 4 way conflict is the most likely with random access. You can tune the run with different N (number of elements in array), NBANK (number of banks in shared memory), WARPSIZE (warp size of machine), and NSAMPLE (number of random shared memory accesses generated to evaluate the model).