How can compiler optimizations affect on a cache-unfriendly algorithm?

I've noticed an interesting behavior in the code of this question which is also comes from Agner Fog in Optimizing software in C++ and it reduces to how data is accessed and stored in the cache (cache associativity). The explanations is clear for me, but then someone pings about volatile... That is if we add volatile qualifier to the matrix declaration: volatile int mat[MATSIZE][MATSIZE]; the running time for value 512 dramatically decreases: 2144 → 1562 μs.

As we know volatile prevents compilers from caching the value (in a CPU register) and from optimizing away accesses to that value when they seem unnecessary from the POV of a program.

One possible version assumes that the computation process happens only in RAM and no cpu caches is used in the case of volatile. But on the other hand the run-time for value 513 again is less than for 512: 1490 μs...

Solution

Unfortunately I cannot confirm that the volatile version is running faster. I did a test run of both the volatile and the non-volatile version, and the time comparison of both can be seen in the chart below. When measuring performance in order for optimizing code it is always important to take several steps (not just one or two, but in the range of thousands) and take the mode (https://en.wikipedia.org/wiki/Mode_(statistics)) of the gathered data as per Alexandrescu's Fastware.

There are various peaks, and deep valleys, but looking at the chart you cannot conclude that the volatile is faster.

Indeed, the code generated by the compilers is different, but not to such an extent, you can check it out on https://godbolt.org/g/ILw3tg