Why are lists<T> so much slower than vector<T*>?

The other day, I was playing with this quickbench: https://quick-bench.com/q/fwF5nkc4ORmYwSXSoVy2d-ekN5U

Basically, the code just measures the time it takes to iterate and sum over different cases : std::vector<int> (reference point), std::vector<Foo>, std::vector<Foo*>, std::list<Foo> and a custom basic implementation of a list. The Foo struct is just a value at the beginning and then some garbage to make it the size of a cache line.

struct Foo {
  int value;
  int garbage[15];
};

The results look like the following:

While I understand the difference between the reference case (std::vector<int>) and the other ones, I am a bit surprised by the difference between the vector<I clearly don't get why the list cases (especially std::list...) are so slow compared to the std::vector<Foo*> case.

Why is it so much slower to follow random pointers versus dereferencing the same random pointers with the main difference (I think) being that the addresses are stored contiguously in memory (but the adresses themselves should be as random as the list ones) ?

Solution

Why is it so much slower to follow random pointers

Because accessing randomly distributed memory is about the slowest thing you can do in a computer program.

That's just it; your processor gets a memory address, needs to fetch that memory, and fetching memory simply takes a lot of time. Due to the sequential nature of a list, your CPU can't do anything based on the next node while the current node is still being fetched, so linked list data structures are quite close to worst-case access paradigms in terms of memory/cache bandwidth, and also in terms of CPU pipelines.

with the main difference (I think) being that the addresses are stored contiguously in memory

CPUs (and also compilers) can reorder things if the next step doesn't inherently depend on results from the previous one. So, your CPU can hide a lot of memory latency by already starting to look up the next memory location (potentially multiple, actually) while waiting for memory from the first to be retrieved.

With a list, that's impossible, because the next memory location to fetch isn't known until the currently requested memory has been fetched, and the pointer to the next node is known.

The raw std::vector<int> of course has no such problems, and by virtue of the values being contiguous in memory, you also minimize memory bandwidth (you only fetch the data you actually need, and you use all of each cacheline, typically 64 B, which is the granularity at which memory is fetched). Bonus point for that fetching then also directly being in the right order: that makes it possible for the compiler to directly fetch the memory into SIMD registers, to do multiple additions at once (if your processor has AVX2, that'd be 8 additions in one instruction, with AVX512, up to 16 additions per instruction. If you have an x86 PC, you almost certainly have AVX2, if you have a modern AMD CPU, AVX512 will be blazingly fast).

Why your custom list is faster than the std::list, I don't really know, but my guess would be on extra sanity checks. You might want to build (with debug symbols, but optimized) run your code locally on your machine and profile it (e.g. with perf if you're on Linux).