Search code examples
ccachingcpu-architecturecpu-cache

how to predict the behaviour of a direct-mapped cache with alternating loads from two 512 byte arrays


given this piece of code:

int x[2][128];
int i;
int sum=0;

for(i=0; i<128; i++){
   sum += x[0][i] * x[1][i];
}

Assuming we execute this under the following conditions:

  1. sizeof(int) = 4.
  2. Array x begins at memory address 0x0 and is stored in row-major order.
  3. In each case below, the cache is initially empty.
  4. The only memory accesses are to the entries of the array x. All other variables are stored in registers.

Given these assumptions, estimate the miss rates for the following cases: Assume the cache is 512 bytes, direct-mapped, with 16-byte cache blocks.

Given this info, I know that there are 32 sets (obtained from 512/16) in this cache. So the first set gets loaded with x[0][i] with 4 ints.

  1. But for the second part x[1][i], how do I know whether this load of the values here will override the first load of x[0][i], x[0][i+1], x[0][i+2], x[0][i+3] ? Or will the x[1][i], x[1][i+1],x[1][i+2],x[1][i+3] be stored in a different set to the first load of x[0][i]? I am confused about how the loading into the cache will be done for this piece of code.

  2. What would be the miss rate of this?

Any help is appreciated please :)


Solution

  • In general it's impossible to predict what will happen in the cache system solely by looking at the C code. In order to do that you at least need to see the generated machine code.

    Remember that the compiler is allowed to do all kinds of optimization tricks as long as the final result and side effects are the same.

    So in principle a clever compiler could turn the code into:

    for(i=0; i<128; i += 4){
       regA = x[0][i];
       regB = x[0][i+1];
       regC = x[0][i+2];
       regD = x[0][i+3];
    
       sum += regA * x[1][i];
       sum += regB * x[1][i+1];
       sum += regC * x[1][i+2];
       sum += regD * x[1][i+3];
    }
    

    which would impact the cache usage completely. Besides that there may be optimization tricks at HW level that you can't even see from the machine code.

    Anyway - if we assume a "direct non-optimized" compilation then you will have 2 cache misses every time you do sum += x[0][i] * x[1][i];

    The reason is that the distance between x[0][i] and x[1][i] is 128 * 4 = 512 and that is exactly the cache size. Consequently, the data from x[0][i] and x[1][i] will use the same cache line which means that the data read after the first cache miss will be overwritten by the data read after the second cache miss.

    So there won't be any cache hits at all. You'll get 2 * 128 = 256 misses and a 100% miss rate.