swift performance profiling benchmarking disassembly

Why does this code speed up when function call overhead is added to the loop?

I have the following Swift code that that takes an iteration count and another argument and performs some computation in a loop.

@inline(never)
func rawComputeAll(iterCount: Int, timeStep: Double) -> Double {
    // most things up here are constants, scroll down to
    let rho = 995.59
    let c = 0.04017672438703612,
        Rt = 0.0042879141883543715
    let s = 4184.0
    let T_in = 300.0,
        S_in = 1000.0
    var h_st = 1.0e5,
        f_st = 0.5,
        m = c * rho,
        sv = s
    var T_out = 300.0,
        h_out = 1.0e5
    for _ in 0..<iterCount {
        T_out = T_in - Rt * S_in
        let T_st = 273.15 + h_st / sv
        let deltaT = T_out - T_st
        let sign = deltaT != 0.0 ? deltaT / abs(deltaT) : deltaT
        let deltaS = sign * S_in * timeStep
        let deltaU = deltaS * 1.0
        let m_in = f_st * timeStep
        let E_in = m_in * h_out
        let m_total = m + m_in
        let E_state = m * h_out + E_in + deltaU
        h_st = E_state / m_total
    }
    return h_st
}

I benchmark it in two ways.

Timing the function directly with some large iterCount like so.

let startTime1 = DispatchTime.now()
let result1 = rawComputeAll(iterCount: iterCount, timeStep: timeStep)
let endTime1 = DispatchTime.now()

(I actually used a proper benchmarking framework that reported the distribution of runtimes, but using this simple timing code here for brevity and ease of reproducibility).

By moving the loop outside and passing an iteration count to the function as one. I also do a little computation for the timeStep so that the swift compiler does not optimize the loop out.

let scaling = 1.0e-6
var result2 = Double.random(in: 0.1...10.0)
let startTime2 = DispatchTime.now()
for _ in 1...iterCount {
    result2 = rawComputeAll(iterCount: 1, timeStep: timeStep + scaling*result2)
}
let endTime2 = DispatchTime.now()

Now, I was expecting the second method to be slower because of the overhead of function calls in the loop (note the @inline(never) used to ensure this), and also because it does a couple of extra FLOPs in the loop body.

But, turns out that the this method with the loop on the outside clocks significantly faster than a straightforward benchmark -- on my machine there is a 30% difference between them. Here is a Godbolt link showcasing this problem.

So, I have two questions.

Why is the second benchmark significantly faster even though it should in principle be doing slightly more work?
As of now I only have speculations and profiling did not reveal anything interesting. How can I diagnose this performance pathology and discover the root cause?

I did all measurements (profiling and benchmarking) on release builds.

Solution

Is for _ in 0..<iterCount just to repeat the same work for benchmarking? So a compiler could at least in theory optimize it to if iterCount > 0 { }? (Spoiler alert: no, the two programs are doing different work.)

Does the run time scale anywhere near linearly with the iteration count for high counts? If not, the compiler may be optimizing away the loop. I'd have expected an optimizing compiler to inline the function at least in the second version if this was just an apples-to-apples throughput benchmark, but your Godbolt link confirms that's not happening.

Copy/pasting the loop from .LBB1_4: to dec rdi / jne .LBB1_4 from output.rawComputeAll into https://uica.uops.info/ to have it analyze for bottlenecks, apparently there's a loop-carried dependency bottleneck of 65 cycles on Skylake.

According to the dependency analysis, it's through XMM11, which starts out initialized to 100000.00 aka 1e5 (from movsd xmm11, qword ptr [rip + .LCPI1_0] and the godbolt mouseover for the .quad 0x40f86a0000000000 bit-pattern shows the value). So it's h_st. Ok yes, we can see in your code that variables are computed from h_st and then at the end of the loop stuff is assigned back to h_st.

So your code has a loop-carried dependency chain

So that's why the compiler wasn't able to optimize away the looping.

Passing a high iteration count measures latency of the calculation in the loop
Calling repeatedly with iterCount=1 measures throughput, where out-of-order exec can overlap work across iterations.

Throughput and latency are separate things on out-of-order exec CPUs. See

Assembly - How to score a CPU instruction by latency and throughput
Modern Microprocessors A 90-Minute Guide! - https://www.lighterra.com/papers/modernmicroprocessors/
Latency bounds and throughput bounds for processors for operations that must occur in sequence
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
How many CPU cycles are needed for each assembly instruction? (that's not an answerable question: there is no single number you can add up across instructions to find total cost.)

For your loop, if we break the dependency by editing the asm loop to add xorps xmm11, xmm11 right before the div xmm11, xmm4, then uiCA predicts it would run in about 11 cycles per iteration on Skylake instead of 66.5, with a bottleneck on divider throughput. But with all the overhead of a function call and reloading the constants every time, that probably takes up too much space in the Reorder Buffer (ROB) for out-of-order exec to see far enough to find the instruction-level parallelism across iterations. (https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/). Or perhaps we're getting into a front-end bottleneck. (At 2 loads per clock on modern CPUs when it hits in cache, reloading the constants is no big deal.)

Benchmarking on Godbolt itself is also going to be very noisy due to other code on the same CPUs as the AWS instance. And suffer from unknown amounts of distortion due to CPU frequency warm-up to max turbo if it wasn't already. Running on your own desktop, CPU warm-up might be an even bigger factor without running any warm-up loop, unless you run the benchmark many times in a row from a shell script loop or something. (Not manually, any gap wouldn't be continuous load.) Idiomatic way of performance evaluation?

Optimizing the division bottleneck

Division has bad throughput, and even worse latency. (On modern CPUs its partially pipelined. In CPUs like Core 2, division wasn't even pipelined at all.)
See Floating point division vs floating point multiplication.

    let T_st = 273.15 + h_st / sv

Could be written as multiplication by a constant or loop-invariant, like

    let T_st = 273.15 + h_st * (1.0 / sv)

Or simply define sv_inverse = 1.0/s and use that. Compilers can't do this optimization for you (without -ffast-math or equivalent), except when the divisor has an inverse that's exactly representable in binary floating-point. (e.g. when it's a power of 2, for example 4.0 and 0.25 are both exactly representable.)

FP multiply or FMA have about 4 cycle latency, vs. about 15 for double-precision division on recent Intel (https://uops.info/). If you can't avoid a loop-carried dependency chain, at least try to keep division and square root out of it.