I tried to understand what was going on during a LOAD and/or STORE instruction. Therefore I performed 4 tests, and each time, I measured the number of cpu cycles (CC) /cache hits (CH) /misses (CM)/data reads (DR)/writes (DW).
After reading the different counters, I just flush the L1 (I/D cache).
Test1:
LDRB R3, [R4,#1]!
STR R3, [SP,#0x48+var_34]
Results: 4 (CC) 3(CH) 1(CM) 1(DR) 2(DW)
Test2:
LDR R3, [SP,#0x48+var_34]
LDR R3, [R3]
Results: 4 3 1 2 1
Test3:
LDR R3, [SP,#0x48+var_38]
LDR R3, [R3]
STR R3, [SP,#0x48+var_30]
Results: 4 4 1 2 2
var_30 is returned at the end of the current function.
Test4:
LDR R2, [SP,#0x48+var_34]
LDR R3, [R2]
Results: 4 3 1 2 1
Here is my understanding:
1. Cache misses
In each test we have 1 cache miss because when one performs
LDR reg, something
"Something" is going to be cached, and there will be a cache miss.
And... that's pretty much the only "logical" interpretation I could make... I do not understand the different values for the cache hits, data reads, and data writes.
Any idea?
the arm documentation at infocenter.arm.com spells out quite clearly what happens on the axi/amba bus in the amba/axi documentation. Now processor to L1 is tightly coupled, not amba/axi, all within the core. If you are only clearing the L1 then the L2 may still contain the values so one experiment compared to others may show different results if the L2 misses or not. Also you are not just measuring the load and store but the fetch of the instructions too, and their alignment will change the results even with two instructions if the cache line is between them the performance may differ than if they were together. there are experiments to do just with that based on alignment within a line as to when and if another cache line fetch goes out.
Also trying to get deterministic numbers on processors like these are a bit difficult, particularly with the cache on. If you are running these experiments on anything but bare metal, they there is no reason to expect any kind of meaningful results. With bare metal the results are still suspect, but can be made to be more deterministic.
If you are simply trying to understand cache basics not specific to the arm or any other platform then just google that, go to wikipedia, etc. TONS of info out there. Cache is just faster ram, closer to the processor in time as well as being fast (more expensive) sram. So quite simply the cache looks at your address, looks it up in a table or set of tables and determines hit or miss, if a hit then it returns the value or accepts the write data and completes the processor side of the transaction (allowing the processor to continue but then goes to write the cache, fire and forget basically). if a miss then it has to figure out if there is a spare opening in the cache for this data, if not it has to evict something by writing it out, then or if there was already an empty spot it can do a cache line read which is often larger than the read you asked for. That hits the l2 in the same way as the l1, hit or miss evict or not and so on, until it either hits a cache layer that gets a hit or until it hits the final ram or peripheral where it gets the data from. then that is written to all the cache layers on the way back to the l1 and then the processor gets its little bit of data it asked for. If the processor asks for another data item in that cache line now it is in l1 and returns really fast. l2 is usually bigger than l1 and so on such that everything in l1 is in l2 but not everything in l2 is in l1, so that you can evict from l1 to l2 and if then something comes along it may miss l1 but hit l2 and still be much faster than going to slow dram. It is a bit like keeping the tools or reference materials or whatever you use most often closer to you at your desk and things you keep less often further away since there isnt room for everything, as you change projects or evolve what is used most often and least often changes and there position on the desk change.