Search code examples
cachingcpu-architecturecpu-cache

reduce the cache misses by increasing size of array - why does this work?


Given this piece of code from the textbook Randal E. Bryant, David R. O’Hallaron - Computer Systems. A Programmer’s Perspective [3rd ed.] (2016, Pearson):

float dotprod(float x[8], float y[8])
{
     float sum = 0.0;
     int i;

     for (i = 0; i < 8; i++)
         sum += x[i] * y[i];
   return sum;
}   

And this is the cache information:

sizeof(float) = 4.

Array x begins at memory address 0x0 and is stored in row-major order. y follows x.

In each case below, the cache is initially empty.

The only memory accesses are to the entries of the array x and y. All other variables are stored in registers.

Cache size 32 bytes and cache block size is 16 bytes. The cache is direct mapped cache.

We're given the explanation that the miss rate is 100%, due to constant thrashing. Then they go on to say this:

textbook

I am wondering why float x[12] would be their choice instead of it being one of these options for example:

  1. x[9]
  2. x[10]
  3. x[11]
  4. x[16]

Like why would none of these 4 options work (I was told that it wouldn't work well?, but I am not sure why as the person didn't give an explanation) and only x[12] would work as a replacement for the x[8] array. Or do they give varying changes in the percentage of cache misses present in the code?


Solution

  • In a direct mapped cache of 32 bytes with 16-byte blocks you will have 2 blocks. This means that for a given address you will have 1 bit to identify the block index, 4 bits for the block offset used to identify the particular byte inside the block (2 for word offset and 2 for byte offset), and the rest will be the tag.

    The cache would look like this:

    Valid? | Tag | Index | Data 
    -------+-----+-------+-----
           |     |   0   | (16 bytes)
    -------+-----+-------+-----
           |     |   1   | (16 bytes)
    -------+-----+-------+-----
    

    Let's assume 8 bit addresses for simplicity (with longer addresses you only get a longer tag, nothing else changes). In case of 8 bit addresses you would have 3 bits for the tag, 1 for the block index and 4 for the block offset. So, given an address, e.g. 0x34, you can decompose it like so:

          block offset 
    tag     |
    001|1|0100
        |    
      block index   
    

    Now assume that the two arrays are in memory one right after the other, like this:

    x[0] | x[1] | ... | x[7] | y[0] | y[1] | ... | y[7]
    

    If x starts at address 0x0, we would have the following situation:

    element  address            element  address  
    x[0]     000|0|0000 (0)     y[0]     001|0|0000 (32)
    x[1]     000|0|0100 (4)     y[1]     001|0|0100 (36)
    x[2]     000|0|1000 (8)     y[2]     001|0|1000 (40)
    x[3]     000|0|1100 (12)    y[3]     001|0|1100 (44) 
    x[4]     000|1|0000 (16)    y[4]     001|1|0000 (48) 
    x[5]     000|1|0100 (20)    y[5]     001|1|0100 (52) 
    x[6]     000|1|1000 (24)    y[6]     001|1|1000 (56) 
    x[7]     000|1|1100 (28)    y[7]     001|1|1100 (60) 
                 |                           |
           block index                 block index
    

    As you can see, the problem here is that the block index is always the same between any two elements of x and y with the same array index. This means a cache miss will happen for every single array access, since you first access x[i] and then y[i] (or the opposite), and each time you have values for the wrong array in the cache block.

    Now suppose you add the appropriate padding after the end of x:

    x[0] | x[1] | ... | x[7] | ...PADDING... | y[0] | y[1] | ... | y[7]
    

    You are now in a much better situation:

    element  addr               element  addr  
    x[0]     000|0|0000 (0)     y[0]     01|1|0000 (48)
    x[1]     000|0|0100 (4)     y[1]     01|1|0100 (52)
    x[2]     000|0|1000 (8)     y[2]     01|1|1000 (56)
    x[3]     000|0|1100 (12)    y[3]     01|1|1100 (60) 
    x[4]     000|1|0000 (16)    y[4]     10|0|0000 (64) 
    x[5]     000|1|0100 (20)    y[5]     10|0|0100 (68) 
    x[6]     000|1|1000 (24)    y[6]     10|0|1000 (72) 
    x[7]     000|1|1100 (28)    y[7]     10|0|1100 (76) 
                 |                          |
           block index                block index
    

    Indeed, this situation is optimal, you will only have 4 cache misses (x[0], y[0], x[4], y[4]).


    Why wouldn't the other options work? Well, let's see.

    === WITH x[9] =======================================
    
    element  address            element  address  
    x[0]     000|0|0000 (0)     y[0]     001|0|0100 (36)
    x[1]     000|0|0100 (4)     y[1]     001|0|1000 (40)
    x[2]     000|0|1000 (8)     y[2]     001|0|1100 (44)
    x[3]     000|0|1100 (12)    y[3]     001|1|0000 (48)
    x[4]     000|1|0000 (16)    y[4]     001|1|0100 (52) 
    x[5]     000|1|0100 (20)    y[5]     001|1|1000 (56) 
    x[6]     000|1|1000 (24)    y[6]     001|1|1100 (60) 
    x[7]     000|1|1100 (28)    y[7]     010|0|0000 (64) 
    
    === WITH x[10] ======================================
    
    element  address            element  address  
    x[0]     000|0|0000 (0)     y[0]     001|0|1000 (40)
    x[1]     000|0|0100 (4)     y[1]     001|0|1100 (44)
    x[2]     000|0|1000 (8)     y[2]     001|1|0000 (48)
    x[3]     000|0|1100 (12)    y[3]     001|1|0100 (52) 
    x[4]     000|1|0000 (16)    y[4]     001|1|1000 (56) 
    x[5]     000|1|0100 (20)    y[5]     001|1|1100 (60) 
    x[6]     000|1|1000 (24)    y[6]     010|0|0000 (64) 
    x[7]     000|1|1100 (28)    y[7]     010|0|0100 (68) 
    
    === WITH x[11] =====================================
    
    element  address            element  address  
    x[0]     000|0|0000 (0)     y[0]     001|0|1100 (44)
    x[1]     000|0|0100 (4)     y[1]     001|1|0000 (48)
    x[2]     000|0|1000 (8)     y[2]     001|1|0100 (52) 
    x[3]     000|0|1100 (12)    y[3]     001|1|1000 (56) 
    x[4]     000|1|0000 (16)    y[4]     001|1|1100 (60) 
    x[5]     000|1|0100 (20)    y[5]     010|0|0000 (64) 
    x[6]     000|1|1000 (24)    y[6]     010|0|0100 (68) 
    x[7]     000|1|1100 (28)    y[7]     010|0|1000 (72) 
    
    === WITH x[16] =====================================
    
    element  address            element  address  
    x[0]     000|0|0000 (0)     y[0]     010|0|0000 (64)
    x[1]     000|0|0100 (4)     y[1]     010|0|0100 (68)
    x[2]     000|0|1000 (8)     y[2]     010|0|1000 (72) 
    x[3]     000|0|1100 (12)    y[3]     010|0|1100 (76) 
    x[4]     000|1|0000 (16)    y[4]     010|1|0000 (80) 
    x[5]     000|1|0100 (20)    y[5]     010|1|0100 (84) 
    x[6]     000|1|1000 (24)    y[6]     010|1|1000 (88) 
    x[7]     000|1|1100 (28)    y[7]     010|1|1100 (92) 
    

    As you can easily tell from the above, none of those choices of padding are optimal. x[16] is basically identical to the worst case, you have too much padding. x[9] only solves the problem for 1/4 of the elements, x[10] for 2/4 of the elements and x[11] for 3/4 of the elements.

    So, exactly as you say, those options give varying changes in the percentage of cache misses present in the code, but none of them is optimal (i.e. lowest miss rate possible).

    The only best configuration is one where you have any padding such that x starts with block index 0 and y starts with block index 1, then x[4] is at block index 1 and y[4] is at block index 0. Either this, or the complete opposite. So good values for the size of array x are 8 * N + 4 for any N >= 1, with the smallest being 12.