Does spatial locality matter for cache performance if object size > cache line?

Say I am storing a linked list of objects, each a struct of size 64 bytes, which is also the size of my cache line. I'll be doing latency-sensitive adds, deletes, and iterations on the linked list over time. I understand that the performance is dominated by whether the objects are kept in the cache, so that access is ~1 nano instead of >50 nanos for RAM access.

It is generally recommended to accomplish this with spatial locality, ideally storing the objects in a contiguous block of memory. This is useful because whenever I access a memory address, the processor actually pulls in a cache line's worth of data; we want this additional data to be other useful objects so we put all our objects in a contiguous block.

I may be misunderstanding, but it seems that we get no benefit from this effect if the object size >= the cache line size. Only one object can ever be brought into the cache at a time.

Solution

The other factor to consider outside of the benefits of pre-loading subsequent items when the data size is less than the cache size is the issue of associativity and mapping. In the case of a linked list, you have no coherent layout (or at least no guarantee of such), so you are much more likely to hit collisions than if the data is laid out with spatial locality. On top of that, you do risk come level of memory fragmentation with the linked list model, although I'm not sure whether that's something you should even worry about.

Depending on the usage, access patterns, etc, for what you're doing, it's definitely worth weighing the relative benefits of algorithm efficiency (deletes are very cheap in a linked list, expensive in an array or similar). If you're doing a lot of deleting/inserting, then the benefits of algorithm efficiency could far outweigh any benefits from cache coherency.

To clarify the associativity concept, worth taking a gander here. Basically, the associativity of the cache dictates how many locations in the cache a specific address can map to. Different cache levels will have difference associativity, so for example in most cases the L1 cache is 2 or 4-way set associative, meaning any address can map to one of two (or four) locations in the cache. L2 and L3 are more likely to be 8 (or sometimes 12 or 16) way associative.

As far as the specific associativity of Intel/AMD/etc CPUs, that's a tougher call, since even Intel folks have a hard time coming up with a good answer! One example I found was for the Xeon x5660 CPU, which is 4 way set associative instruction in L1, 8-way set associative data, 8-way set associative in the L2, and 16-way in L3.

The algorithms used by modern CPUs for cache usage, etc, are pretty darned amazing, and go beyond just the basics outlined here, so I think in practice you'd find very little impact from this.