Search code examples
cperformancecpu-cache

How does the CPU cache affect the performance of a C program


I am trying to understand more about how CPU cache affects performance. As a simple test I am summing the values of the first column of a matrix with varying numbers of total columns.

// compiled with: gcc -Wall -Wextra -Ofast -march=native cache.c
// tested with: for n in {1..100}; do ./a.out $n; done | tee out.csv
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

double sum_column(uint64_t ni, uint64_t nj, double const data[ni][nj])
{
    double sum = 0.0;
    for (uint64_t i = 0; i < ni; ++i) {
        sum += data[i][0];
    }
    return sum;
}

int compare(void const* _a, void const* _b)
{
    double const a = *((double*)_a);
    double const b = *((double*)_b);
    return (a > b) - (a < b);
}

int main(int argc, char** argv)
{
    // set sizes
    assert(argc == 2);
    uint64_t const iter_max = 101;
    uint64_t const ni       = 1000000;
    uint64_t const nj       = strtol(argv[1], 0, 10);

    // initialize data
    double(*data)[nj] = calloc(ni, sizeof(*data));
    for (uint64_t i = 0; i < ni; ++i) {
        for (uint64_t j = 0; j < nj; ++j) {
            data[i][j] = rand() / (double)RAND_MAX;
        }
    }

    // test performance
    double* dt        = calloc(iter_max, sizeof(*dt));
    double const sum0 = sum_column(ni, nj, data);
    for (uint64_t iter = 0; iter < iter_max; ++iter) {
        clock_t const t_start = clock();
        double const sum      = sum_column(ni, nj, data);
        clock_t const t_stop  = clock();
        assert(sum == sum0);
        dt[iter] = (t_stop - t_start) / (double)CLOCKS_PER_SEC;
    }

    // sort dt
    qsort(dt, iter_max, sizeof(*dt), compare);

    // compute mean dt
    double dt_mean = 0.0;
    for (uint64_t iter = 0; iter < iter_max; ++iter) {
        dt_mean += dt[iter];
    }
    dt_mean /= iter_max;

    // print results
    printf("%2lu %.8e %.8e %.8e %.8e\n", nj, dt[iter_max / 2], dt_mean, dt[0],
        dt[iter_max - 1]);

    // free memory
    free(data);
}

However, the results are not quite how I would expect them to be: nj = 1..100

As far as I understand, when the CPU loads a value from data, it also places some of the following values of data in the cache. The exact number depends on the cache line size (64 byte on my machine). This would explain, why with growing nj the time to solution first increases linearly and levels out at some value. If nj == 1, one load places the next 7 values in the cache and thus we only need to load from RAM every 8th value. If nj == 2, following the same logic, we need to access the RAM every 4th value. After some size, we will have to access the RAM for every value, which should result in the performance leveling out. My guess, for why the linear section of the graph goes further than 4 is that in reality there are multiple levels of cache at work here and the way that values end up in these caches is a little more complex than what I explained here.

What I cannot explain is why there are these performance peaks at multiples of 16.

After thinking about this question for a bit, I decided to check if this also occurs for higher values of nj: nj = 1..500

In fact, it does. And, there is more: Why does the performance increase again after ~250?

Could someone explain to me, or point me to some appropriate reference, why there are these peaks and why the performance increases for higher values of nj.

If you would like to try the code for yourself, I will also attach my plotting script, for your convenience:

import numpy as np
import matplotlib.pyplot as plt

data = np.genfromtxt("out.csv")
data[:,1:] /= data[0,1]

dy = np.diff(data[:,2]) / np.diff(data[:,0])
for i in range(len(dy) - 1):
    if dy[i] - dy[i + 1] > (dy.max() - dy.min()) / 2:
        plt.axvline(data[i + 1,0], color='gray', linestyle='--')
        plt.text(data[i + 1,0], 1.5 * data[0,3], f"{int(data[i + 1,0])}",
                 rotation=0, ha="center", va="center",
                 bbox=dict(boxstyle="round", ec='gray', fc='w'))

plt.fill_between(data[:,0], data[:,3], data[:,4], color='gray')
plt.plot(data[:,0], data[:,1], label="median")
plt.plot(data[:,0], data[:,2], label="mean")
plt.legend(loc="upper left")
plt.xlabel("nj")
plt.ylabel("dt / dt$_0$")
plt.savefig("out.pdf")

Solution

  • The plots show the combination of several complex low-level effects (mainly cache trashing & prefetching issues). I assume the target platform is a mainstream modern processor with cache lines of 64 bytes (typically a x86 one).

    I can reproduce the problem on my i5-9600KF processor. Here is the resulting plot:

    performance plot


    First of all, when nj is small, the gap between fetched address (ie. strides) is small and cache lines are relatively efficiently used. For example, when nj = 1, the access is contiguous. In this case, the processor can efficiently prefetch the cache lines from the DRAM so to hide its high latency. There is also a good spatial cache locality since many contiguous items share the same cache line. When nj=2, only half the value of a cache line is used. This means the number of requested cache line is twice bigger for the same number of operations. That being said the time is not much bigger due to the relatively high latency of adding two floating-point numbers resulting in a compute-bound code. You can unroll the loop 4 times and use 4 different sum variables so that (mainstream modern) processors can add multiple values in parallel. Note that most processors can also load multiple values from the cache per cycle. When nj = 4 a new cache line is requested every 2 cycles (since a double takes 8 bytes). As a result, the memory throughput can become so big that the computation becomes memory-bound. One may expect the time to be stable for nj >= 8 since the number of requested cache line should be the same, but in practice processors prefetch multiple contiguous cache lines so not to pay the overhead of the DRAM latency which is huge in this case. The number of prefetched cache lines is generally between 2 to 4 (AFAIK such prefetching strategy is disabled on Intel processors when the stride is bigger than 512, so when nj >= 64. This explains why the timings are sharply increasing when nj < 32 and they become relatively stable with 32 <= nj <= 256 with exceptions for peaks.

    The regular peaks happening when nj is a multiple of 16 are due to a complex cache effect called cache thrashing. Modern cache are N-way associative with N typically between 4 and 16. For example, here are statistics on my i5-9600KF processors:

    Cache 0: L1 data cache,        line size 64,  8-ways,    64 sets, size 32k 
    Cache 1: L1 instruction cache, line size 64,  8-ways,    64 sets, size 32k 
    Cache 2: L2 unified cache,     line size 64,  4-ways,  1024 sets, size 256k 
    Cache 3: L3 unified cache,     line size 64, 12-ways, 12288 sets, size 9216k 
    

    This means that two fetched values from the DRAM with the respective address A1 and A2 can results in conflicts in my L1 cache if (A1 % 32768) / 64 == (A2 % 32768) / 64. In this case, the processor needs to choose which cache line to replace from a set of N=8 cache lines. There are many cache replacement policy and none is perfect. Thus, some useful cache line are sometime evicted too early resulting in additional cache misses required later. In pathological cases, many DRAM locations can compete for the same cache lines resulting in excessive cache misses. More information about this can be found also in this post.

    Regarding the nj stride, the number of cache lines that can be effectively used in the L1 cache is limited. For example, if all fetched values have the same address modulus the cache size, then only N cache lines (ie. 8 for my processor) can actually be used to store all the values. Having less cache lines available is a big problem since the prefetcher need a pretty large space in the cache so to store the many cache lines needed later. The smaller the number of concurrent fetches, the lower memory throughput. This is especially true here since the latency of fetching 1 cache line from the DRAM is about several dozens of nanoseconds (eg. ~70 ns) while its bandwidth is about dozens of GiB/s (eg. ~40 GiB/s): dozens of cache lines (eg. ~40) should be fetched concurrently so to hide the latency and saturate the DRAM.

    Here is the simulation of the number of cache lines that can be actually used in my L1 cache regarding the value of the nj:

     nj  #cache-lines
      1      512
      2      512
      3      512
      4      512
      5      512
      6      512
      7      512
      8      512
      9      512
     10      512
     11      512
     12      512
     13      512
     14      512
     15      512
     16      256    <----
     17      512
     18      512
     19      512
     20      512
     21      512
     22      512
     23      512
     24      512
     25      512
     26      512
     27      512
     28      512
     29      512
     30      512
     31      512
     32      128    <----
     33      512
     34      512
     35      512
     36      512
     37      512
     38      512
     39      512
     40      512
     41      512
     42      512
     43      512
     44      512
     45      512
     46      512
     47      512
     48      256    <----
     49      512
     50      512
     51      512
     52      512
     53      512
     54      512
     55      512
     56      512
     57      512
     58      512
     59      512
     60      512
     61      512
     62      512
     63      512
     64       64    <----
    ==============
     80      256
     96      128
    112      256
    128       32
    144      256
    160      128
    176      256
    192       64
    208      256
    224      128
    240      256
    256       16
    384       32
    512        8
    1024       4
    

    We can see that the number of available cache lines is smaller when nj is a multiple of 16. In this case, the prefetecher will preload data into cache lines that are likely evicted early by subsequent fetched (done concurrently). Loads instruction performed in the code are more likely to result in cache misses when the number of available cache line is small. When a cache miss happen, the value need then to be fetched again from the L2 or even the L3 resulting in a slower execution. Note that the L2 cache is also subject to the same effect though it is less visible since it is larger. The L3 cache of modern x86 processors makes use of hashing to better distributes things to reduce collisions from fixed strides (at least on Intel processors and certainly on AMD too though AFAIK this is not documented).

    Here is the timings on my machine for some peaks:

      32 4.63600000e-03 4.62298020e-03 4.06400000e-03 4.97300000e-03
      48 4.95800000e-03 4.96994059e-03 4.60400000e-03 5.59800000e-03
      64 5.01600000e-03 5.00479208e-03 4.26900000e-03 5.33100000e-03
      96 4.99300000e-03 5.02284158e-03 4.94700000e-03 5.29700000e-03
     128 5.23300000e-03 5.26405941e-03 4.93200000e-03 5.85100000e-03
     192 4.76900000e-03 4.78833663e-03 4.60100000e-03 5.01600000e-03
     256 5.78500000e-03 5.81666337e-03 5.77600000e-03 6.35300000e-03
     384 5.25900000e-03 5.32504950e-03 5.22800000e-03 6.75800000e-03
     512 5.02700000e-03 5.05165347e-03 5.02100000e-03 5.34400000e-03
    1024 5.29200000e-03 5.33059406e-03 5.28700000e-03 5.65700000e-03
    

    As expected, the timings are overall bigger in practice for the case where the number of available cache lines is much smaller. However, when nj >= 512, the results are surprising since they are significantly faster than others. This is the case where the number of available cache lines is equal to the number of ways of associativity (N). My guess is that this is because Intel processors certainly detect this pathological case and optimize the prefetching so to reduce the number of cache misses (using line-fill buffers to bypass the L1 cache -- see below).

    Finally, for large nj stride, a bigger nj should results in higher overheads mainly due to the translation lookaside buffer (TLB): there are more page addresses to translate with bigger nj and the number of TLB entries is limited. In fact this is what I can observe on my machine: timings tends to slowly increase in a very stable way unlike on your target platform.

    I cannot really explain this very strange behavior yet. Here is some wild guesses:

    • The OS could tend to uses more huge pages when nj is large (so to reduce de overhead of the TLB) since wider blocks are allocated. This could result in more concurrency for the prefetcher as AFAIK it cannot cross page boundaries. You can try to check the number of allocated (transparent) huge-pages (by looking AnonHugePages in /proc/meminfo in Linux) or force them to be used in this case (using an explicit memmap), or possibly by disabling them. My system appears to make use of 2 MiB transparent huge-pages independently of the nj value.
    • If the target architecture is a NUMA one (eg. new AMD processors or a server with multiple processors having their own memory), then the OS could allocate pages physically stored on another NUMA node because there is less space available on the current NUMA node. This could result in higher performance due to the bigger throughput (though the latency is higher). You can control this policy with numactl on Linux so to force local allocations.

    For more information about this topic, please read the great document What Every Programmer Should Know About Memory. Moreover, a very good post about how x86 cache works in practice is available here.


    Removing the peaks

    To remove the peaks due to cache trashing on x86 processors, you can use non-temporal software prefetching instructions so cache lines can be fetched in a non-temporal cache structure and into a location close to the processor that should not cause cache trashing in the L1 (if possible). Such cache structure is typically a line-fill buffers (LFB) on Intel processors and the (equivalent) miss address buffers (MAB) on AMD Zen processors. For more information about non-temporal instructions and the LFB, please read this post and this one. Here is the modified code that also include a loop unroling optimization to speed up the code when nj is small:

    double sum_column(uint64_t ni, uint64_t nj, double* const data)
    {
        double sum0 = 0.0;
        double sum1 = 0.0;
        double sum2 = 0.0;
        double sum3 = 0.0;
    
        if(nj % 16 == 0)
        {
            // Cache-bypassing prefetch to avoid cache trashing
            const size_t distance = 12;
            for (uint64_t i = 0; i < ni; ++i) {
                _mm_prefetch(&data[(i+distance)*nj+0], _MM_HINT_NTA);
                sum0 += data[i*nj+0];
            }
        }
        else
        {
            // Unrolling is much better for small strides
            for (uint64_t i = 0; i < ni; i+=4) {
                sum0 += data[(i+0)*nj+0];
                sum1 += data[(i+1)*nj+0];
                sum2 += data[(i+2)*nj+0];
                sum3 += data[(i+3)*nj+0];
            }
        }
        
        return sum0 + sum1 + sum2 + sum3;
    }
    

    Here is the result of the modified code:

    performance plot of the optimized code

    We can see that peaks no longer appear in the timings. We can also see that the values are much bigger due to dt0 being about 4 times smaller (due to the loop unrolling).

    Note that cache trashing in the L2 cache is not avoided with this method in practice (at least on Intel processors). This means that the effect is still here with huge nj strides multiple of 512 (4 KiB) on my machine (it is actually a slower than before, especially when nj >= 2048). It may be a good idea to stop the prefetching when (nj%512) == 0 && nj >= 512 on x86 processors. The effect AFAIK, there is no way to address this problem. That being said, this is a very bad idea to perform such big strided accesses on very-large data structures.

    Note that distance should be carefully chosen since early prefetching can result cache line being evicted before they are actually used (so they need to be fetched again) and late prefetching is not much useful. I think using value close to the number of entries in the LFB/MAB is a good idea (eg. 12 on Skylake/KabyLake/CannonLake, 22 on Zen-2).