I am trying to understand more about how CPU cache affects performance. As a simple test I am summing the values of the first column of a matrix with varying numbers of total columns.
// compiled with: gcc -Wall -Wextra -Ofast -march=native cache.c
// tested with: for n in {1..100}; do ./a.out $n; done | tee out.csv
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
double sum_column(uint64_t ni, uint64_t nj, double const data[ni][nj])
{
double sum = 0.0;
for (uint64_t i = 0; i < ni; ++i) {
sum += data[i][0];
}
return sum;
}
int compare(void const* _a, void const* _b)
{
double const a = *((double*)_a);
double const b = *((double*)_b);
return (a > b) - (a < b);
}
int main(int argc, char** argv)
{
// set sizes
assert(argc == 2);
uint64_t const iter_max = 101;
uint64_t const ni = 1000000;
uint64_t const nj = strtol(argv[1], 0, 10);
// initialize data
double(*data)[nj] = calloc(ni, sizeof(*data));
for (uint64_t i = 0; i < ni; ++i) {
for (uint64_t j = 0; j < nj; ++j) {
data[i][j] = rand() / (double)RAND_MAX;
}
}
// test performance
double* dt = calloc(iter_max, sizeof(*dt));
double const sum0 = sum_column(ni, nj, data);
for (uint64_t iter = 0; iter < iter_max; ++iter) {
clock_t const t_start = clock();
double const sum = sum_column(ni, nj, data);
clock_t const t_stop = clock();
assert(sum == sum0);
dt[iter] = (t_stop - t_start) / (double)CLOCKS_PER_SEC;
}
// sort dt
qsort(dt, iter_max, sizeof(*dt), compare);
// compute mean dt
double dt_mean = 0.0;
for (uint64_t iter = 0; iter < iter_max; ++iter) {
dt_mean += dt[iter];
}
dt_mean /= iter_max;
// print results
printf("%2lu %.8e %.8e %.8e %.8e\n", nj, dt[iter_max / 2], dt_mean, dt[0],
dt[iter_max - 1]);
// free memory
free(data);
}
However, the results are not quite how I would expect them to be:
As far as I understand, when the CPU loads a value from data
, it also places some of the following values of data
in the cache. The exact number depends on the cache line size (64 byte on my machine). This would explain, why with growing nj
the time to solution first increases linearly and levels out at some value. If nj == 1
, one load places the next 7 values in the cache and thus we only need to load from RAM every 8th value. If nj == 2
, following the same logic, we need to access the RAM every 4th value. After some size, we will have to access the RAM for every value, which should result in the performance leveling out. My guess, for why the linear section of the graph goes further than 4 is that in reality there are multiple levels of cache at work here and the way that values end up in these caches is a little more complex than what I explained here.
What I cannot explain is why there are these performance peaks at multiples of 16.
After thinking about this question for a bit, I decided to check if this also occurs for higher values of nj
:
In fact, it does. And, there is more: Why does the performance increase again after ~250?
Could someone explain to me, or point me to some appropriate reference, why there are these peaks and why the performance increases for higher values of nj
.
If you would like to try the code for yourself, I will also attach my plotting script, for your convenience:
import numpy as np
import matplotlib.pyplot as plt
data = np.genfromtxt("out.csv")
data[:,1:] /= data[0,1]
dy = np.diff(data[:,2]) / np.diff(data[:,0])
for i in range(len(dy) - 1):
if dy[i] - dy[i + 1] > (dy.max() - dy.min()) / 2:
plt.axvline(data[i + 1,0], color='gray', linestyle='--')
plt.text(data[i + 1,0], 1.5 * data[0,3], f"{int(data[i + 1,0])}",
rotation=0, ha="center", va="center",
bbox=dict(boxstyle="round", ec='gray', fc='w'))
plt.fill_between(data[:,0], data[:,3], data[:,4], color='gray')
plt.plot(data[:,0], data[:,1], label="median")
plt.plot(data[:,0], data[:,2], label="mean")
plt.legend(loc="upper left")
plt.xlabel("nj")
plt.ylabel("dt / dt$_0$")
plt.savefig("out.pdf")
The plots show the combination of several complex low-level effects (mainly cache trashing & prefetching issues). I assume the target platform is a mainstream modern processor with cache lines of 64 bytes (typically a x86 one).
I can reproduce the problem on my i5-9600KF processor. Here is the resulting plot:
First of all, when nj
is small, the gap between fetched address (ie. strides) is small and cache lines are relatively efficiently used. For example, when nj = 1
, the access is contiguous. In this case, the processor can efficiently prefetch the cache lines from the DRAM so to hide its high latency. There is also a good spatial cache locality since many contiguous items share the same cache line. When nj=2
, only half the value of a cache line is used. This means the number of requested cache line is twice bigger for the same number of operations. That being said the time is not much bigger due to the relatively high latency of adding two floating-point numbers resulting in a compute-bound code. You can unroll the loop 4 times and use 4 different sum variables so that (mainstream modern) processors can add multiple values in parallel. Note that most processors can also load multiple values from the cache per cycle. When nj = 4
a new cache line is requested every 2 cycles (since a double
takes 8 bytes). As a result, the memory throughput can become so big that the computation becomes memory-bound. One may expect the time to be stable for nj >= 8
since the number of requested cache line should be the same, but in practice processors prefetch multiple contiguous cache lines so not to pay the overhead of the DRAM latency which is huge in this case. The number of prefetched cache lines is generally between 2 to 4 (AFAIK such prefetching strategy is disabled on Intel processors when the stride is bigger than 512, so when nj >= 64
. This explains why the timings are sharply increasing when nj < 32
and they become relatively stable with 32 <= nj <= 256
with exceptions for peaks.
The regular peaks happening when nj
is a multiple of 16 are due to a complex cache effect called cache thrashing. Modern cache are N-way associative with N typically between 4 and 16. For example, here are statistics on my i5-9600KF processors:
Cache 0: L1 data cache, line size 64, 8-ways, 64 sets, size 32k
Cache 1: L1 instruction cache, line size 64, 8-ways, 64 sets, size 32k
Cache 2: L2 unified cache, line size 64, 4-ways, 1024 sets, size 256k
Cache 3: L3 unified cache, line size 64, 12-ways, 12288 sets, size 9216k
This means that two fetched values from the DRAM with the respective address A1 and A2 can results in conflicts in my L1 cache if (A1 % 32768) / 64 == (A2 % 32768) / 64
. In this case, the processor needs to choose which cache line to replace from a set of N=8
cache lines. There are many cache replacement policy and none is perfect. Thus, some useful cache line are sometime evicted too early resulting in additional cache misses required later. In pathological cases, many DRAM locations can compete for the same cache lines resulting in excessive cache misses. More information about this can be found also in this post.
Regarding the nj
stride, the number of cache lines that can be effectively used in the L1 cache is limited. For example, if all fetched values have the same address modulus the cache size, then only N cache lines (ie. 8 for my processor) can actually be used to store all the values. Having less cache lines available is a big problem since the prefetcher need a pretty large space in the cache so to store the many cache lines needed later. The smaller the number of concurrent fetches, the lower memory throughput. This is especially true here since the latency of fetching 1 cache line from the DRAM is about several dozens of nanoseconds (eg. ~70 ns) while its bandwidth is about dozens of GiB/s (eg. ~40 GiB/s): dozens of cache lines (eg. ~40) should be fetched concurrently so to hide the latency and saturate the DRAM.
Here is the simulation of the number of cache lines that can be actually used in my L1 cache regarding the value of the nj
:
nj #cache-lines
1 512
2 512
3 512
4 512
5 512
6 512
7 512
8 512
9 512
10 512
11 512
12 512
13 512
14 512
15 512
16 256 <----
17 512
18 512
19 512
20 512
21 512
22 512
23 512
24 512
25 512
26 512
27 512
28 512
29 512
30 512
31 512
32 128 <----
33 512
34 512
35 512
36 512
37 512
38 512
39 512
40 512
41 512
42 512
43 512
44 512
45 512
46 512
47 512
48 256 <----
49 512
50 512
51 512
52 512
53 512
54 512
55 512
56 512
57 512
58 512
59 512
60 512
61 512
62 512
63 512
64 64 <----
==============
80 256
96 128
112 256
128 32
144 256
160 128
176 256
192 64
208 256
224 128
240 256
256 16
384 32
512 8
1024 4
We can see that the number of available cache lines is smaller when nj
is a multiple of 16. In this case, the prefetecher will preload data into cache lines that are likely evicted early by subsequent fetched (done concurrently). Loads instruction performed in the code are more likely to result in cache misses when the number of available cache line is small. When a cache miss happen, the value need then to be fetched again from the L2 or even the L3 resulting in a slower execution. Note that the L2 cache is also subject to the same effect though it is less visible since it is larger. The L3 cache of modern x86 processors makes use of hashing to better distributes things to reduce collisions from fixed strides (at least on Intel processors and certainly on AMD too though AFAIK this is not documented).
Here is the timings on my machine for some peaks:
32 4.63600000e-03 4.62298020e-03 4.06400000e-03 4.97300000e-03
48 4.95800000e-03 4.96994059e-03 4.60400000e-03 5.59800000e-03
64 5.01600000e-03 5.00479208e-03 4.26900000e-03 5.33100000e-03
96 4.99300000e-03 5.02284158e-03 4.94700000e-03 5.29700000e-03
128 5.23300000e-03 5.26405941e-03 4.93200000e-03 5.85100000e-03
192 4.76900000e-03 4.78833663e-03 4.60100000e-03 5.01600000e-03
256 5.78500000e-03 5.81666337e-03 5.77600000e-03 6.35300000e-03
384 5.25900000e-03 5.32504950e-03 5.22800000e-03 6.75800000e-03
512 5.02700000e-03 5.05165347e-03 5.02100000e-03 5.34400000e-03
1024 5.29200000e-03 5.33059406e-03 5.28700000e-03 5.65700000e-03
As expected, the timings are overall bigger in practice for the case where the number of available cache lines is much smaller. However, when nj >= 512
, the results are surprising since they are significantly faster than others. This is the case where the number of available cache lines is equal to the number of ways of associativity (N). My guess is that this is because Intel processors certainly detect this pathological case and optimize the prefetching so to reduce the number of cache misses (using line-fill buffers to bypass the L1 cache -- see below).
Finally, for large nj
stride, a bigger nj
should results in higher overheads mainly due to the translation lookaside buffer (TLB): there are more page addresses to translate with bigger nj
and the number of TLB entries is limited. In fact this is what I can observe on my machine: timings tends to slowly increase in a very stable way unlike on your target platform.
I cannot really explain this very strange behavior yet. Here is some wild guesses:
nj
is large (so to reduce de overhead of the TLB) since wider blocks are allocated. This could result in more concurrency for the prefetcher as AFAIK it cannot cross page
boundaries. You can try to check the number of allocated (transparent) huge-pages (by looking AnonHugePages
in /proc/meminfo
in Linux) or force them to be used in this case (using an explicit memmap
), or possibly by disabling them. My system appears to make use of 2 MiB transparent huge-pages independently of the nj
value.numactl
on Linux so to force local allocations.For more information about this topic, please read the great document What Every Programmer Should Know About Memory. Moreover, a very good post about how x86 cache works in practice is available here.
To remove the peaks due to cache trashing on x86 processors, you can use non-temporal software prefetching instructions so cache lines can be fetched in a non-temporal cache structure and into a location close to the processor that should not cause cache trashing in the L1 (if possible). Such cache structure is typically a line-fill buffers (LFB) on Intel processors and the (equivalent) miss address buffers (MAB) on AMD Zen processors. For more information about non-temporal instructions and the LFB, please read this post and this one. Here is the modified code that also include a loop unroling optimization to speed up the code when nj
is small:
double sum_column(uint64_t ni, uint64_t nj, double* const data)
{
double sum0 = 0.0;
double sum1 = 0.0;
double sum2 = 0.0;
double sum3 = 0.0;
if(nj % 16 == 0)
{
// Cache-bypassing prefetch to avoid cache trashing
const size_t distance = 12;
for (uint64_t i = 0; i < ni; ++i) {
_mm_prefetch(&data[(i+distance)*nj+0], _MM_HINT_NTA);
sum0 += data[i*nj+0];
}
}
else
{
// Unrolling is much better for small strides
for (uint64_t i = 0; i < ni; i+=4) {
sum0 += data[(i+0)*nj+0];
sum1 += data[(i+1)*nj+0];
sum2 += data[(i+2)*nj+0];
sum3 += data[(i+3)*nj+0];
}
}
return sum0 + sum1 + sum2 + sum3;
}
Here is the result of the modified code:
We can see that peaks no longer appear in the timings. We can also see that the values are much bigger due to dt0
being about 4 times smaller (due to the loop unrolling).
Note that cache trashing in the L2 cache is not avoided with this method in practice (at least on Intel processors). This means that the effect is still here with huge nj
strides multiple of 512 (4 KiB) on my machine (it is actually a slower than before, especially when nj >= 2048
). It may be a good idea to stop the prefetching when (nj%512) == 0 && nj >= 512
on x86 processors. The effect AFAIK, there is no way to address this problem. That being said, this is a very bad idea to perform such big strided accesses on very-large data structures.
Note that distance
should be carefully chosen since early prefetching can result cache line being evicted before they are actually used (so they need to be fetched again) and late prefetching is not much useful. I think using value close to the number of entries in the LFB/MAB is a good idea (eg. 12 on Skylake/KabyLake/CannonLake, 22 on Zen-2).