assembly mips cpu-cache micro-optimization mars-simulator

manipulating mips assembly code to decrease cache miss rate (mars simulator)

How could an assembly code be optimised to decrease the miss rate of the cache? I am aware that changing the placement policy/block size/block replacement policy has effects on the cache miss rate, but I am specifically asking for the settings on mars to be fixed.

We are currently asked to work with direct mapping, block size words 2 and cache size 128 bytes and instead focus on the code to improve the miss rate.

My data segment is this currently:

.data                                           #data segment
    X: .word 1,1,1,2,2,3,4,2,1,1,2,3,2,4,8,1,1,1,1,2,2,3,4,2,1,1,2,3,2,4,8,1    #X[]
    H: .word 1,5,1,4,2,1,1,1,1,5,1,4,2,1,1,1,1,5,1,4,2,1,1,1                    #H[]
    Y: .word 0:8                                #Y[]

    m: .word 8                                  #m = 8
    n: .word 24                                 #n = 24

I noticed that moving the X[] array under n decreased the miss rate from 53% to 44%. Could I manipulate my code further to decrease the miss rate? Can it only be done by manipulating the data segment? Why does the hit rate improve when I move the X array a few lines down?

I would appreciate a more "Why this happens and how to take advantage of it" kind of explanation, but I will leave the entirety of my code for the sake of examples.

    .data                                           #data segment

    H: .word 1,5,1,4,2,1,1,1,1,5,1,4,2,1,1,1,1,5,1,4,2,1,1,1                    #H[]
    n: .word 24                                 #n = 24
    Y: .word 0:8                                #Y[]

    m: .word 8                                  #m = 8

    X: .word 1,1,1,2,2,3,4,2,1,1,2,3,2,4,8,1,1,1,1,2,2,3,4,2,1,1,2,3,2,4,8,1    #X[]

.text
fir:                                            #void fir
    la $t0, X                             #address of X[]  
    la $t1, H                              #address of H[]
    la $t2, Y                              #address of Y[]
    
    
    lw $t3, n
    lw $t4, m

    add $t5,$zero,$zero                              #set j counter to 0
    addi $s4,$zero,4   
                             

    j for1                                      #go to for1 loop

for1:   
    beq $t5,$t4,exit                              #when j is equal to m loop stops and goes to the exit function
    add $t6,$zero,$zero                               #y0 = r6 = 0
    add $t7,$zero,$zero                               #set i counter to 0 for every loop

    bne $t5,$t4,for2                              #go to the nested loop for2

for2:
    beq $t7,$t3, afterfor2                        # for(i=0, i<n, i++)

    #making X[i+j]
    add $t8,$t5,$t7                               #i+j stored in r8
    mul $t9,$t8,$s4                              #(i+j)*8 to find address stored in r9
    addu $t9,$t9,$t0
    lw $s0,($t9)                                 #load x[i+j] to $s0

    #making H[i]
    mul $s1,$s4,$t7                             #8*i
    addu $s1,$s1,$t1                            #adding 8*i to the base address
    lw $s2,($s1)                                #load h[i] to $s2

    mul $s3,$s0,$s2                            #x[i+j] * h[i]
    add $t6,$t6,$s3                              #y0 = y0 + x[i+j]*h[i]
    addi $t7,$t7,1                               # i++
    j for2                                      # repeat loop

afterfor2: 
        
    mul $s5,$t5,$s4                             #8*j stored to $s5
    addu $s5,$s5,$t2
    sw $t6,($s5)                                 #Y[j] = y0
    addi $t5,$t5,1                               #j++
    
    j for1                                      #go back to start of loop for1


exit:
    li $v0, 10
    syscall                                       #exit program

Are there any commands (for example sw or la) that could change in order to improve the hit rate or does it only depend on the data segment?

Solution

The hit rate depends on both data placement and the access pattern by the code.

In a direct mapped cache, there's only one place to cache each byte of memory, and, that place is determined by the byte's memory address.

Specifically, an address breaks down into fields as follows:

    +--------+-------+--------+
    |  tag   | index | offset |  field name
    +--------+-------+--------+
        25       3       4       field size

There are 8 blocks/lines, so the index size is 3 bits (8 values). There are 4 words per block/line, which is 16 bytes, so the offset is 4 bits (16 values).

When two close memory variables, by their addresses, have the same index & tag then, when they are alternately accessed, the cache performs well.

When two distant memory variables, by their addresses, have the same index (but have different tag), then when they are alternately accessed, the direct mapped cache must evict one in order to cache the other.

If the program's access pattern alternates between two distant memory variables, then it will thrash the cache.

For example, let's focus on a single cache block, the first one at index 0:

Let's say that we access location 0x10010000. That will be cached in block/line 0 because the index field for that address is 000₂. Next we access location 0x10010080. This is a different address but also resolves to index 0. The direct mapped cache has no choice but to evict address 0x10010000 from block 0, so that it can cache address 0x10010080 there. If the program next accesses 0x10010000 again, as that is no longer cached, then that is a miss, which has the side effect of evicting 0x10010080 from block 0. Each time the program goes between those two addresses, there's a miss.

Moving one array relative to the other changes which elements of the two arrays share the same index and thus collide in the cache.

Changing the size of the arrays would also have an effect: a 24 word array has 96 bytes, which is a good chunk of that cache.

Changing the algorithm would change the access pattern and thus also have an effect. Sometimes we can change an algorithm to work on smaller parts of the array at a time, and that can result in less conflict.

Changing the cache design can dramatically improve thrashing at the cost of more hardware.