I want to study the effects of L2 cache misses on CPU power consumption. To measure this, I have to create a benchmarks that gradually increase the working set size such that core activity (micro-operations executed per cycle) and L2 activity (L2 request per cycle) remain constant, but the ratio of L2 misses to L2 requests increases.
Can anyone show me an example of C program which forces "N" numbers of L2 cache misses?
You can generally force cache misses at some cache level by randomly accessing a working set larger than that cache level1.
You would expect the probability of any given load to be a miss to be something like: p(hit) = min(100, C / W)
, and p(miss) = 1 - p(hit)
where p(hit)
and p(miss)
are the probabilities of a hit and miss, C
is the relevant cache size, and W
is the working set size. So for a miss rate of 50%, use a working set of twice the cache size.
A quick look at the formula above shows that p(miss)
will never be 100%, since C/W
only goes to 0 as W goes to infinity (and you probably can't afford an infinite amount of RAM). So your options are:
Getting "close enough" by using a very large working set (e.g., 4 GB gives you a 99%+ miss chance for a 256 KB), and pretending you have a miss rate of 100%.
Applying the formula to determine the actual expected number of misses. E.g., if you are using a working size of 2560 KB against an L2 cache of 256 KB, you have a miss rate of 90%. So if you want to examine the effect of 1,000 misses, you should make 1000 / 0.9 = ~1111 memory access to get about 1,000 misses.
Use any approximate approach but then actually count the number of misses you incur using the performance counter units on your CPU. For example, on Linux you could use PAPI or on Linux and Windows you could use Intel's PCM (if you are using Intel hardware).
Use an "almost random" approach to force the number of misses you want. The formula above is valid for random accesses, but if you choose you access pattern so that it is random with the caveat that it doesn't repeat "recent" accesses, you can get a 100% miss ratio. Here "recent" means accesses to cache lines that are likely to still be in the cache. Calculating what that means exactly is tricky, and depends in detail on the associativity and replacement algorithm of the cache, but if you don't repeat any access that has occurred in the last cache_size * 10
accesses, you should be pretty safe.
As for the C code, you should at least show us what you've tried. A basic outline is to create a vector of bytes or ints or whatever with the required size, then to randomly access that vector. If you make each access dependent on the previous access (e.g., use the integer read to calculate the index of the next read) you will also get a rough measurement of the latency of that level of cache. If the accesses are independent, you'll probably have several outstanding misses to the cache at once, and get more misses per unit time. Which one you are interested in depend on what you are studying.
For an open source project that does this kind of memory testing across different stride and working set sizes, take a look at TinyMemBench.
1 This gets a bit trickier for levels of caches that are shared among cores (usually L3 for recent Intel chips, for example) - but it should work well if your machine is pretty quiet while testing.