I have the following loop in asm:
.LBB5_5:
vaddpd ymm0, ymm0, ymmword, ptr, [rdi, +, 8*rcx]
vaddpd ymm1, ymm1, ymmword, ptr, [rdi, +, 8*rcx, +, 32]
vaddpd ymm2, ymm2, ymmword, ptr, [rdi, +, 8*rcx, +, 64]
vaddpd ymm3, ymm3, ymmword, ptr, [rdi, +, 8*rcx, +, 96]
add rcx, 16
cmp rax, rcx
jne .LBB5_5
This is part of a bigger function which calculates the sum of of an [f64]
array in Rust.
I benchmarked this code with the criterion crate and get that 1 000 000 000
doubles take 200 000 000
cycles on my Rocket Lake CPU (i7 11700K)
In various sources I find that the latency of a floating point addition is 4 cycles on this CPU.
This would mean that each of the vaddpd
can only run every 4th cycle, because they carry a dependency of the previous sum. This would mean I can only do 4 double addition per cycle at max.
My measurement shows it does 5 additions per cycle. (It uses the RDTSC
instruction to measure it, I am not sure if this can be problematic)
I mostly want to understand what is going on and test how well I understand the CPU pipeline.
I think you’re observing 5 additions per cycle because you’re using RDTSC
to measure.
For the last decade or so, RDTSC instruction does not count CPU cycles. Instead, it measures wallclock time using base frequency of the CPU.
Your CPU has base frequency 3.6 GHz, and max turbo frequency is 5.00 GHz. If you run a short test your CPU gonna run at turbo frequency, however the counter measured with RDTSC still runs at base frequency.