Search code examples
c++cachingmemorycpu-architecturememory-alignment

Align struct while minimizing cache line waste


There are many great threads on how to align structs to the cache line (e.g., Aligning to cache line and knowing the cache line size).

Imagine you have a system with 256B cache line size, and a struct of size 17B (e.g., a tightly packed struct with two uint64_t and one uint8_t). If you align the struct to cache line size, you will have exactly one cache line load per struct instance.

For machines with a cache line size of 32B or maybe even 64B, this will be good for performance, because we avoid having to fetch 2 caches lines as we do definitely not cross CL boundaries.

However, on the 256B machine, this wastes lots of memory and results in unnecessary loads when iterating through an array/vector of this struct. In fact, you could store 15 instances of the struct in a single cacheline.

My question is two-fold:

  1. In C++17 and above, using alignas, I can align to cache line size. However, it is unclear to me how I can force alignment in a way that is similar to "put as many instances in a cache line as possible without crossing the cache line boundary, then start at the next cache line". So something like this:

enter image description here

where the upper box is a cache line and the other boxes are instances of our small struct.

  1. Do I actually want this? I cannot really wrap my head around this. Usually, we say if we align our struct to the cache line size, access will be faster, as we just have to load a single cache line. However, seeing my example, I wonder if this is actually true. Wouldn't it be faster to not be aligned, but instead store many more instances in a single cache line?

Thank you so much for your input here. It is much appreciated.


Solution

  • To address (2), it is unclear whether the extra overhead of using packed structs (e.g., unaligned 64-bit accesses) and the extra math to access array elements will be worth it. But if you want to try it, you can create a new struct to pack your struct elements appropriately, then create a small wrapper class to access the elements like you would an array:

        #include <array>
        #include <iostream>
        
        using namespace std;
        
        template <typename T, size_t BlockAlignment>
        struct __attribute__((packed)) Packer
        {
            static constexpr size_t NUM_ELEMS = BlockAlignment / sizeof(T);
            static_assert( NUM_ELEMS > 0, "BlockAlignment too small for one object." );
            T &operator[]( size_t index ) { return packed[index]; }
            
            T packed[ NUM_ELEMS ];
            uint8_t padding[ BlockAlignment - sizeof(T)*NUM_ELEMS ];
        };
        
        template <typename T, size_t NumElements, size_t BlockAlignment>
        struct alignas(BlockAlignment) PackedAlignedArray
        {
            typedef Packer<T, BlockAlignment> PackerType;
            std::array< PackerType, NumElements / PackerType::NUM_ELEMS + 1 > packers;
            T &operator[]( size_t index ) { 
                return packers[ index / PackerType::NUM_ELEMS ][ index % PackerType::NUM_ELEMS ];
            }
        };
        
        struct __attribute__((packed)) Foo
        {
            uint64_t a;
            uint64_t b;
            uint8_t c;
        };
        
        int main()
        {
            static_assert( sizeof(Foo) == 17, "Struct not packed for test" );
            constexpr size_t NUM_ELEMENTS = 10;
            constexpr size_t BLOCK_ALIGNMENT = 64;
    
            PackedAlignedArray<Foo, NUM_ELEMENTS, BLOCK_ALIGNMENT> theArray;
    
            for ( size_t i=0; i<NUM_ELEMENTS; ++i )
            {
                // Display the memory offset between the current
                // element and the start of the array
                cout << reinterpret_cast<std::ptrdiff_t>(&theArray[i]) - 
                    reinterpret_cast<std::ptrdiff_t>(&theArray[0]) << std::endl;
            }
        
            return 0;
        }
    

    The output of the program shows the byte offsets of the addresses in memory of the the 17-byte elements, automatically resetting to a multiple of 64 every four elements:

    0
    17
    34
    64
    81
    98
    128
    145
    162
    192