caching data-structures linked-list unroll

Optimal block size for unrolled linked lists

I'm learning about basic data structures and so far have got to unrolled linked lists. A book I have says that if I make the number of elements in each block sized at most to the size of one cache line I will get better cache performance from the improved memory locality. I have two questions about this.

First, is it optimal to make it exactly the size of a cache line, or is any smaller size that is indivisible good?

Second, I found in this post that the line sizes for L1/2/3 cache are 64 Bytes. I Just wanted to make sure that this is for all model i7? I have a mid-2014 MBP and trying to create an unrolled linked list that is optimal for my system. Is there any terminal command to check cache line size?

Solution

The element in a node in an unrolled linked list are all accessed very fast¹.
The bytes in a cached line are all accessed very fast.

We can see the analogy here, the unrolled linked lists are there to compact the items into continuously area of memory so for them to be more cache friendly.

To see why having a node size bigger than a cache line may be a problem, consider an architecture with a cache (of any associativity) with only one line of size S.
Consider also an unrolled linked list with node size 2S.
Finally, lets analyze the cache misses of the algorithm

For each node N
  Let avg = ArithmeticMean(N.items)
  For i = 0 To N.numerOfItems - 1
     N.items[i] = avg

That set the value of each item (assume a full node) in a node to the arithmetic mean of the node.

To compute the mean all elements must be summed, accessing the first element trigger a cache load (+1). Within the first half the elements are read from the cache line just loaded.
As soon as the first element in the second half is accessed another cache load is required and the old line is flushed (+2). Until the end of the node this second load fulfill all the future accesses.
Once we have the mean, the first half is again accessed with a subsequent cache load (+3), evicting the line with the second half that will be reloaded once again later soon (+4).

The algorithm triggers 4 cache loads for node. If we make the size of the node S and repeat the analysis, we'll see that only a cache load is required.

Making the node smaller than the cache lines will also do, some nodes may end up sharing the same line but it won't hurt in general. However this will use more lines vs total number of elements in the list since each one is at its own address and they are not necessarily close together. In the limit if S=1 we have an ordinary linked list.

So far all not-so-old Intel CPU have 64 bytes cache line.
This can very well change though.

To see your CPU cache info you can refer to this question: finding L2 cache size in Linux².

It boils down to use sudo dmidecode -t cache.

¹ Thanks the fact that an array is used to store the elements, allowing random access.

² For all cache levels infact.