I'm developing a CPU-FPGA co-processing framework, so I need complete control over the alignment of my data. I have a data structure that only requires 5 bytes:
typedef struct __attribute__ ((packed)) {
uint32_t dst;
uint8_t weight;
} edg_t;
My FPGA interface can read at 1 cache-line (64 bytes) per cycle (200 million reads / second). It is critical to my performance that I cram as many elements into one cache-line, so padding the struct is out of the question.
5 bytes : 12 elements / read
8 bytes : 8 elements / read (padded)
padding -> 1.5x performance reduction
However, I cannot have a struct straddle the edge between cache-lines that requires that I build logic on the FPGA to constantly shift the read data.
My current solution when building the buffer looks like this:
int num_elements = 1000;
int num_cachelines = num_elements / 12 + 1;
uint8_t* buffer = new uint8_t[num_cachelines * 64]
uint8_t* buf_ptr = buffer - 4;
for (int i = 0; i < num_elements; i++) {
if (i % 12 == 0) buf_ptr += 4; //skip the last 4 bytes of each cache-line
edg_t* edg_ptr = (edg_t*) buf_ptr;
edg_ptr->dst = i; //example, I have random generators here
edg_ptr->weight = i % 256;
buf_ptr++;
}
Now that was fine when the FPGA was doing all the work on its own, now I want the FPGA and CPU to cooperate. this means that the CPU now has to read the buffer as well.
I was wondering if there exists a better way to have the compiler handle the padding automatically or will I have to manually skip the bytes every-time like I did in the buffer creation code above?
I'm assuming that you'll create this buffer structure once and that you then populated it over and over again for the FPGA to read (or vice-a-versa). If so, this layout should work:
constexpr size_t cacheline_size = 64;
constexpr size_t num_elements = 1000;
struct __attribute__ ((packed)) edg_t {
/*volatile*/ uint32_t dst; // volatile if the FPGA writes too
/*volatile*/ uint8_t weight;
};
constexpr size_t elements_per_cachline = cacheline_size/sizeof(edg_t);
constexpr size_t num_cachelines = num_elements / elements_per_cachline + 1;
struct alignas(cacheline_size) cacheline_t {
std::array<edg_t, elements_per_cachline> edg;
inline auto begin() { return edg.begin(); }
inline auto end() { return edg.end(); }
};
struct cacheline_collection_t {
std::array<cacheline_t, num_cachelines> cl;
inline void* address_for_fpga() { return this; }
inline auto begin() { return cl.begin(); }
inline auto end() { return cl.end(); }
};
int main() {
cacheline_collection_t clc;
std::cout << "edg_t : "
<< alignof(edg_t) << " " << sizeof(clc.cl[0].edg[0]) << "\n";
std::cout << "cacheline_t : "
<< alignof(cacheline_t) << " " << sizeof(clc.cl[0]) << "\n";
std::cout << "cacheline_collection_t: "
<< alignof(cacheline_collection_t) << " " << sizeof(clc) << "\n";
// access
for(auto& cl : clc) {
for(auto& edg : cl) {
std::cout << edg.dst << " " << (unsigned)edg.weight << "\n";
}
}
}
The assembly @ godbolt looks nice. The inner loop has been completely inlined to 12 code blocks where it adds 5 to the rax
offset for each block. Then it goes to the next cacheline (conditionally) in 3 operations:
add rax, 64
cmp rax, rcx
jne .LBB0_1