algorithm performance caching performance-testing

why is the cache-misses to instructions ratio a better indicator of cache performance compared to the cache-references to cache-misses ratio?

I am using perf to test the code of a theoretically proven to be cache friendly algorithm.

According to this article the cache-misses to instructions is a good indicator of cache performance.

The ratio of cache-misses to instructions will give an indication how well the cache is working; the lower the ratio the better. In this example the ratio is 1.26% (6,605,955 cache-misses/525,543,766 instructions). Because of the relatively large difference in cost between the RAM memory and cache access (100’s cycles vs <20 cycles) even small improvements of cache miss rate can significantly improve performance. If the cache miss rate per instruction is over 5%, further investigation is required.

However when I run perf like this:

perf stat -B -e cache-references,cache-misses,instructions ./td 1.txt 2.txt

Perf will print the following:

Performance counter stats for './td 1.txt 2.txt':

    93,497,101      cache-references                                            
    56,452,246      cache-misses              #   60.379 % of all cache refs    
 8,115,626,200      instructions             

   2.509309040 seconds time elapsed

So it focuses more on the cache-references to cache-misses ratio instead of the one suggested in the article.

The cache-misses to cache-references ratio seems very bad, 60%, which means 60% of the time my application accesses the cache I get a cache miss. On the other hand the cache-misses to instructions ratio is only 0.6%.

I am not sure what to get out of this. Which ratio should I aim to optimize?

Solution

They're ultimately both misleading, but in different ways.

Misses to hits is interesting to know, however you can "soak" some misses by having lots of arithmetic to do while the miss is being handled. Misses to #instructions would tell you something about that, in that a very low number of misses/instruction suggests that you're in such a case. It doesn't mean you actually are though, for example if the address of the next load which misses is calculated by a long calculation that itself depends on the previous miss, then it all becomes serialized and the misses/instruction becomes a bit misleading. Even so, if it's low enough, then the total time mostly depends on the arithmetic so the misses wouldn't be a big problem.

It shouldn't necessarily be optimized, since you can cheat by doing useless arithmetic work. Or, more reasonably, do a trade-off that costs more (useful) arithmetic to lose some misses, which sounds good except that you can go too far with it. Obviously if it starts taking more actual time (or in general it starts performing worse in whatever you really care about), it doesn't matter that you're improved some fairly artificial metric (unless that really was what you cared about, which you might in a synthetic benchmark).

Misses/reference obviously tells you something about your access patterns, but doing lots of cache hitting memory references isn't an indicator of good code: maybe there are just a lot of unnecessary memory references. Or to put it an other way, if data is only needed once, then touching it again (even when that generates no miss) is still a waste. It really depends on the problem. For example, if you're just summing up an array, putting the accumulator(s) in memory looks really good from the point of view of this metric, but it's obviously a really bad thing to do.

So,

Which ratio should I aim to optimize?

Neither, unless you're going for something purely synthetic. Use them to get a clue about how the code is performing, then optimize elapsed time (or power or whatever, depends on your goals). These metics being "suspiciously good" can be as much of a clue as them being "bad", so perhaps that should be just called low and high.