Search code examples
c++memorydata-structuresalignment

C++ Align 5 byte structs to cacheline


I'm developing a CPU-FPGA co-processing framework, so I need complete control over the alignment of my data. I have a data structure that only requires 5 bytes:

typedef struct __attribute__ ((packed))  {
    uint32_t dst;
    uint8_t weight;
} edg_t;

My FPGA interface can read at 1 cache-line (64 bytes) per cycle (200 million reads / second). It is critical to my performance that I cram as many elements into one cache-line, so padding the struct is out of the question.

5 bytes : 12 elements / read
8 bytes : 8 elements / read (padded)
padding -> 1.5x performance reduction

However, I cannot have a struct straddle the edge between cache-lines that requires that I build logic on the FPGA to constantly shift the read data.

My current solution when building the buffer looks like this:

int num_elements = 1000;
int num_cachelines = num_elements / 12 + 1;

uint8_t* buffer = new uint8_t[num_cachelines * 64]
uint8_t* buf_ptr = buffer - 4;

for (int i = 0; i < num_elements; i++) {
    if (i % 12 == 0) buf_ptr += 4; //skip the last 4 bytes of each cache-line

    edg_t* edg_ptr = (edg_t*) buf_ptr;
    edg_ptr->dst = i; //example, I have random generators here
    edg_ptr->weight = i % 256;
    buf_ptr++;

}

Now that was fine when the FPGA was doing all the work on its own, now I want the FPGA and CPU to cooperate. this means that the CPU now has to read the buffer as well.

I was wondering if there exists a better way to have the compiler handle the padding automatically or will I have to manually skip the bytes every-time like I did in the buffer creation code above?


Solution

  • I'm assuming that you'll create this buffer structure once and that you then populated it over and over again for the FPGA to read (or vice-a-versa). If so, this layout should work:

    constexpr size_t cacheline_size = 64;
    constexpr size_t num_elements = 1000;
    
    struct __attribute__ ((packed)) edg_t  {
        /*volatile*/ uint32_t dst;   // volatile if the FPGA writes too
        /*volatile*/ uint8_t weight;
    };
    
    constexpr size_t elements_per_cachline = cacheline_size/sizeof(edg_t);
    constexpr size_t num_cachelines = num_elements / elements_per_cachline + 1;
    
    struct alignas(cacheline_size) cacheline_t {
        std::array<edg_t, elements_per_cachline> edg;
        inline auto begin() { return edg.begin(); }
        inline auto end() { return edg.end(); }
    };
    
    struct cacheline_collection_t {
        std::array<cacheline_t, num_cachelines> cl;
        inline void* address_for_fpga() { return this; }
        inline auto begin() { return cl.begin(); }
        inline auto end() { return cl.end(); }
    };
    
    int main() {
        cacheline_collection_t clc;
        std::cout << "edg_t                 : "
           << alignof(edg_t) << " " << sizeof(clc.cl[0].edg[0]) << "\n";
        std::cout << "cacheline_t           : "
           << alignof(cacheline_t) << " " << sizeof(clc.cl[0]) << "\n";
        std::cout << "cacheline_collection_t: "
           << alignof(cacheline_collection_t) << " " << sizeof(clc) << "\n";
    
        // access
        for(auto& cl : clc) {
            for(auto& edg : cl) {
                std::cout << edg.dst << " " << (unsigned)edg.weight << "\n";
            }
        }
    }
    

    The assembly @ godbolt looks nice. The inner loop has been completely inlined to 12 code blocks where it adds 5 to the rax offset for each block. Then it goes to the next cacheline (conditionally) in 3 operations:

        add     rax, 64
        cmp     rax, rcx
        jne     .LBB0_1