c performance caching benchmarking cpu-cache

Confused about cpu cache benchmark results

Section 3.3.2 of Ulrich Drepper's "What Every Programmer Should Know About Memory" describes a simple benchmark demonstrating the effects of caching on sequential reads. The benchmark consists of following a (sequentially laid out) circular linked list and measuring the average time spent per node. The node struct looks like this:

struct l {
  struct l *n;
  long int pad[NPAD];
};

A working set of size N contains N/sizeof(l) nodes. With enough padding, each pointer is stored in different cache line, so we need a cache of size at least 64 * N/sizeof(l) to follow the cycle without expensive memory access.

Here's the part that confuses me: Assume that 64 * N/sizeof(l) is exactly the size of the biggest cache. If we, say, halve sizeof(l) by changing the padding, N also needs to be 2x smaller if we want everything to still fit in the cache. So with different strides, we should run out of cache at different working set sizes. But Figure 3.11 (I don't have enough rep to embed images) from the text seems to contradict this - the sudden jump in access time happens at N=2^19 for all three NPAD sizes.

What am I missing?

At first I thought that maybe the padding gets cached too - but that's not true as explained in answers to this question.

Solution

With enough padding, each pointer is stored in different cache line, so we need a cache of size at least 64 * N/sizeof(l) to follow the cycle without expensive memory access.

No, if a cache line is 64 bytes, each node is in a different cache line if sizeof(struct l) is at least 64. But that is not the main point of section 3.3.2.

Assume that 64 * N/sizeof(l) is exactly the size of the biggest cache. If we, say, halve sizeof(l) by changing the padding, N also needs to be 2x smaller if we want everything to still fit in the cache.

The size of cache would be a fixed number of bytes, not a calculation such as 64 * N/sizeof(l). In the part of this subsection you refer to, Drepper uses a 1 MiB (2²⁰ byte) L2 cache. Additionally, Drepper uses N in describing the “working set size” (apparently the total array size, although not all of the array is accessed in some trials) as 2^N bytes, so the “working set” fits in cache if 2^N bytes is less than or equal to the cache size. If the array size is 2^N bytes, the number of elements must be 2^N / sizeof(struct l).

In figure 3.11, what you see is:

As long as the total array size is smaller than 2¹⁹ bytes, everything in it fits in L2 cache, and access is “fast.”
At 2¹⁹ and 2²⁰ bytes, the entire array would fit in cache by itself, but there are other things in the system that need cache too, so we see intermediate effects, which I will not discuss in detail here.
Beyond 2²⁰ bytes, the array elements cannot remain in cache when they are all accessed consecutively (even though, with large padding, the actually-accessed bytes total much less than the cache size, the cache associativity is such that they overflow their designated cache sets, effectively the same as overflowing all of cache).
- For zero padding, prefetching serves well: The hardware observes memory is being read consecutively and prefetches cache lines. Observe that, with zero padding, several elements fit within one cache line, so each prefetch of one cache line provides several elements, so the time per element is low.
- For small padding, presumably where there is one element per cache line, prefetching operates in the same way: The hardware observes memory is being read consecutively and prefetches cache lines. However, there is only one array element per cache line, so the time per element is higher, even though the time per cache line is the same.
- For moderate padding and larger padding, prefetching either does not operate or does not serve well.

I am skeptical of some of Drepper’s statements. For example, regarding the difference between padding of 15 and 31 long int (apparently changing the element size from 128 to 256 bytes), he writes:

Where the element size is hampering the prefetching efforts is a result of a limitation of hardware prefetching: it cannot cross page boundaries.

However, with a page size of 4096 bytes (inferred from a mention elsewhere that 60 pages are 245,760 bytes) and element spacing of 128 to 256 bytes, reads would encounter page boundaries only every 32 and 16 elements, respectively. The jump from about 145 cycles to around 320 cycles per element because there is an additional 1/32 page transitions per element (1 per 16 elements minus 1 per 32 elemments = 1/32 per element) implies 1/32•C = 320−145, where C is the number of cycles a page transition costs. Then C = (320-145)•32 = 5,600. I would not trust this large figure without more targeted tests or supporting documentation.

Drepper also writes, in parenthesized italics:

Yes, this is a bit inconsistent because in the other tests we count the unused part of the struct in the element size and we could define NPAD so that each element fills a page. In that case the working set sizes would be very different. This is not the point of this test, though, and since prefetching is ineffective anyway this makes little difference.

This does not match his earlier description of the working set, where he says a working set of 2^N bytes contains 2^N/sizeof(struct l) elements. The latter shows his “working set” is the entire array, even if only part of each element is accessed. The former says the working set would vary depending on how much of each element is accessed.

Overall, I think Drepper is somewhat loose with his descriptions and attributions of cause. Especially without sample code, it is hard to follow some of the test descriptions or to reproduce them. This supposed modeling of array elements of various sizes is actually a modeling of skipping through memory accessing only parts—if those were actual array elements a program was working with, it would load the entire element, and prefetching would serve well per byte used. That does not invalidate Drepper’s measurements once this confusing description is understood, but it is a misleading description. Similarly, the difference descriptions of “working set” in different places are confusing.

So my advice would be to take what is written here with a grain of salt and not expect it to be a completely solid and clear specification of cache and memory performance.