algorithm data-structures stack cpu cpu-cache

How to store items in the LIFO stack in a cache-friendly manner?

EDIT / DISCLAIMER:

It appears that my understanding of cache lines was wrong which makes this question not relevant and misleading. I thought whenever CPU tries to fetch a memory at a specific index, it also grabs indexes right after the first one which is NOT true. See How do cache lines work? for reference.

I was about to write a data container for storing a continuous and re-sizable memory block, in which items would only be possible to access from one side by pushing or popping - basically a LIFO stack. I've been reading a lot about CPU cache and memory access patterns and wanted said container to be as cache-friendly as possible.

So I've taken a look at common implementations of LIFO stacks on the internet. Most of them suggest using a dynamic array as a base and accessing the data by appending or removing items from the end of the array. In this case the empty capacity gap is stored after the data in the following way:

 stack bottom            stack top
 array begin             V  array end
 V                       V  V
|V          data         V |V     gap     |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|8 |7 |6 |5 |4 |3 |2 |1 |0 |  |  |  |  |  |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
                        |       cache line       |

However, this does not seem very cache-friendly to me. Whenever one looks up the item on the top of the stack, CPU would fetch the memory from gap containing garbage, or at least this is how I understand it.

Would the approach where gap appears as the begging of memory block and the data at the end be more performent? In this case the indexes of the items would be reversed, same with bottom and top. In this case the cache line could hit more items like so:

                stack top               stack bottom
                V                       V
|      gap     |V          data         V |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|  |  |  |  |  |0 |1 |2 |3 |4 |5 |6 |7 |8 |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
               |       cache line      |

Is my understanding correct here? Which approach is more performent and cache-friendly?

I could be totally wrong with my assumptions. As I said most of the implementations that I saw use the 1st approach so there must be something to it.

Cheers!

Solution

LIFO stacks using an array are equally cache-friendly for either growth direction, unless you have a specific pattern of sizes that happens to favour an odd-sized allocation and growing down.

APIs for efficient reallocation (without copying the data) like Linux mremap or C realloc are designed around adding new space at the end of an allocation, not the start, so that's a point in favour of the normal way if your stack can grow beyond the initial allocation. At least if you're using an allocation API that supports those, not like C++'s crippled new/delete which essentially forces std::vector to suck.

Other than that, they're equal unless your high water mark is commonly 1 or 2 elements more than a multiple of 64 bytes, like you've shown here to make your other version look good. In that version, the first element pushed is in a cache line by itself, but the top-of-stack hasn't yet crossed into a new cache line.

If you push one more element in your second diagram, it will be the only in-use element in its cache line. It only looks good because you picked a number of elements for the current stack state that happens to work well with the misalignment (relative to cache-line boundaries) you chose for the first element (the "bottom" of the stack, at the highest address for your grows-down stack).

With your grows-down gap-at-the-front design, potentially the first element pushed (highest address) is in the same cache line as some other useful data, so that cache-line staying hot can be helpful (if your stack frequently becomes empty so you actually access that element).
Memory allocators often round up allocations to multiples of some larger size like 16 bytes, but there could still be other allocations in one cache line.
Still, unless you know something special about your expected stack sizes and/or what comes next, normally you'd want to allocate a chunk of memory that's a multiple of the cache-line size.

Fun fact: the asm callstack grows down on mainstream ISAs (and calling conventions for ISAs like MIPS which don't favour one stack-growth direction). e.g. x86 push and call decrement the stack pointer register (RSP). The reasons for this aren't cache efficiency, more historical, dating back to before CPUs had caches. It doesn't present any particular problems for CPU caches, though.

CPU/asm stacks are allocated such that the highest-address cache line is fully used, unlike yours where the end of the allocation isn't the end of a cache line. (Typically "the stack" is a multiple of the page size, so there won't be other stuff next to it in the same line anyway.)