c caching cpu cpu-architecture cpu-cache

Why can pointer chasing in double-linked list avoid cache thrashing (self-eviction)?

I was trying to understand this paper about cache timing issues

In Section 3.6, the authors explains a technique that allows you to populate a contiguous cache region and measure the time for this populating process. They mentioned:

"A naive implementation of the prime and probe steps (i.e., scanning the memory buffer in fixed strides) gives poor results due to two optimizations implemented in modern CPUs: reordering of memory accesses, and automatic read-ahead of memory by the “hardware prefetcher”. Our attack code works around both disruptions by using the following “pointer chasing” technique. During initialization, the attacker’s memory is organized into a linked list (optionally, randomly permuted); later, priming and probing are done by traversing this list (see Figure 7). To minimize cache thrashing (self-eviction), we use a doubly-linked list and traverse it forward for priming but backward for probing. Moreover, to avoid “polluting” its own samples, the probe code stores each obtained sample into the same cache set it has just finished measuring. On some platforms one can improve the timing gap by using writes instead of reads, or more than W reads."

I have a question about this paragraph:

How does linked list avoid the timing variation due to hardware prefetch and reordering?

Solution

Currently implemented hardware prefetchers only handle fixed stride access patterns, so accesses that are ordered arbitrarily with respect to memory address are not recognized as prefetchable. Such hardware prefetchers could still detect and use accidental stride patterns. E.g., if address[X+5] = address[X] + N and address[x+7] = address[x+5] + 2N (note that the addresses do not have to be separated by a constant access count), a prefetcher might predict a stride of N. This could have the effect of modestly reducing apparent miss latency (some prefetches would be valid) or introducing some cache pollution.

(If the detected/predicted N is small and the linked list is contained within a modest region of memory relative to cache size (so that most of the stride prefetches will be to addresses that will be used soonish), stride-based prefetching might have a noticeable positive effect.)

By using data-dependent accesses (traversing a link requires access to the data), access reordering is prevented. Hardware cannot load the next node until the next pointer from the previous node is available.

There have been academic proposals for supporting prefetching of such patterns and even generic correlations (e.g., see this Google Scholar search for Markov prefetching), but such have not yet (to my knowledge) been implemented in commercial hardware.

As a side note, by traversing the probing accesses in reverse order from the priming accesses, for common, LRU-oriented replacement excessive misses are avoided.