caching compiler-optimization cpu-cache micro-optimization prefetch

Is there a way to "unfetch" a cache line?

Let's say I'm looping through 10 different 4kb arrays of ints, incrementing them:

int* buffers[10] = ...; // 10 4kb buffers, not next to each other
for (int i = 0; i < 10; i++) {
  for (int j = 0; j < 512; j++) {
    buffers[i][j]++;
  }
}

The compiler/CPU are pretty cool, and can do some cache prefetching for the inner loop. That's awesome. But...

...I've just eaten up to 40kb of cache, and kicked out data which the rest of my program enjoyed having in the cache.

It would be cool if I could hint to the compiler or CPU that "I'm not touching this memory again in the foreseeable future, so you can reuse these cache lines":

int* buffers[10] = ...;
for (int i = 0; i < 10; i++) {
  for (int j = 0; j < 512; j++) {
    buffers[i][j]++;
  }
  // Unfetch entire 4kb buffer
  cpu_cache_unfetch(buffers[i], 4096);
}

cpu_cache_unfetch would conceptually "doom" any cache lines in that range, throwing them away first.

In the end, this will mean that my little snippet of code uses 4kb of cache, instead of 40kb. It would reuse the 4kb of cache 10 times. The rest of the program would appreciate that very much.

Would this even make sense? If so, is there a way to do this?

Also appreciated: let me know all the ways I've shown myself to fundamentally misunderstand caching! =D

Solution

I only know the answer for x86. This is definitely architecture-specific; different ISAs have different cache-control features.

On x86, yes, clflush / clflushopt, but they only evict one single cache line per execution. (They force write-back + eviction, like you'd need for memory-mapped non-volatile storage). My understanding is that clflushopt is not usually worth it for a case like this, vs. just allowing cache pollution to happen.

In theory there are possible speedups from using NT prefetch for read-only, but that's brittle (tuning the software-prefetch depends on the HW, and getting it wrong can hurt a lot). Doing a regular store would probably undo the effects of an NT prefetch and leave the line in the most-recently-used position in L1, L2, and L3.

One possibly-crazy approach would be NT stores. Load a whole cache-line of data (four 16-byte vectors = 64 bytes), then store the updated values with movntdq.

NT means "non-temporal"; for use when data will not be referenced again in the near future (even by another core). What is the meaning of "non temporal" memory accesses in x86 has some pretty generic answers, but may help.

According to Intel's manual, NT stores evict the destination cache line if it was previously cached (What happens with a non-temporal store if the data is already in cache?), so it would work for your use-case. But the compiler would have to be sure to reach a 64-byte alignment boundary in the inner loop so it can read one or two whole cache lines, instead of reading 32 bytes of one and 32 bytes of another, and evicting it with an NT store before reading the last 32 bytes of a line. (Pointer math is easy in asm, though; Compilers do know how to go scalar until an alignment boundary.)

The normal use-case for NT stores is for write-only destination buffers to avoid the MESI RFO overhead, but this use-case is at least possibly a win.

See discussion in ~~comments~~ chat: this might perform significantly worse. Definitely benchmark both ways before doing this, preferably on a variety of hardware including multi-socket systems.

It's also almost definitely worse if the array was hot in cache to start with. I was assuming that this was the only thing to touch it, rather than the last in a chain of modifications.