algorithm performance optimization memory-management cpu-cache

What is the cache miss rate for an optimal matrix transpose?

If I have an M x N matrix and an L1 cache of size K what cache miss rate does an optimal matrix transpose have. Obviously I am looking for something that is a function of M and N (and possibly K, though that is maybe too complex) rather than a specific number.

I am asking because I have a lot of matrix data that has to be processed in both directions and I would like a rule of thumb to know when it is worth while keeping both the original data and a transpose in memory.

Solution

You haven't said anything about the cache type you have, is it direct mapped? N-way set associative? Assuming a N-way set associative (and yes you do need all the details of the cache that depends on your specific CPU architecture) and assuming one specific matrix ordering e.g. column-major then you will have mostly cold misses basically M*N/C where C is the cache line size (which is CPU dependent but usually 8 doubles :)).

Then you will have stridded accesses on the target matrix and this unless the matrix is small enough to fit entirely in L1 you can assume a worst-case scenario of M*N cold misses e.g. an L1 of size 32kB you can fit 4000 doubles i.e. a matrix of size ~63*63.

Therefore we would be looking at a worst-case (M*N/C + M*N) total L1 misses for the transposition.

One idea would be to do the trick of flipping the matrix ordering e.g. from column-major to row-major, instead of physically moving it, access it as transposed. This is a zero cost operation if you have the right matrix implementation where you could flip the matrix ordering on the same data.

The real expensive prefetches though are never at L1 but at the LLC (last level cache) even if you get L1 misses it is still a cheap miss because it will be loaded from L2. In conclusion it is very hard to calculate unless you have all the tiny small details of your target CPU archirecture.