Understanding throughput of simd sum implementation x86

I have the following loop in asm:

.LBB5_5:
 vaddpd  ymm0, ymm0, ymmword, ptr, [rdi, +, 8*rcx]
 vaddpd  ymm1, ymm1, ymmword, ptr, [rdi, +, 8*rcx, +, 32]
 vaddpd  ymm2, ymm2, ymmword, ptr, [rdi, +, 8*rcx, +, 64]
 vaddpd  ymm3, ymm3, ymmword, ptr, [rdi, +, 8*rcx, +, 96]
 add     rcx, 16
 cmp     rax, rcx
 jne     .LBB5_5

This is part of a bigger function which calculates the sum of of an [f64] array in Rust.

I benchmarked this code with the criterion crate and get that 1 000 000 000 doubles take 200 000 000 cycles on my Rocket Lake CPU (i7 11700K)

In various sources I find that the latency of a floating point addition is 4 cycles on this CPU. This would mean that each of the vaddpd can only run every 4th cycle, because they carry a dependency of the previous sum. This would mean I can only do 4 double addition per cycle at max.

My measurement shows it does 5 additions per cycle. (It uses the RDTSC instruction to measure it, I am not sure if this can be problematic)

I mostly want to understand what is going on and test how well I understand the CPU pipeline.

Solution

I think you’re observing 5 additions per cycle because you’re using RDTSC to measure.

For the last decade or so, RDTSC instruction does not count CPU cycles. Instead, it measures wallclock time using base frequency of the CPU.

Your CPU has base frequency 3.6 GHz, and max turbo frequency is 5.00 GHz. If you run a short test your CPU gonna run at turbo frequency, however the counter measured with RDTSC still runs at base frequency.