javascript performance google-chrome-devtools x86-64 v8

Unexpected slowdown in Chrome on high-iteration count

I was comparing two functions' performance in Chrome's console, when I stumbled across something I can't explain.

const A = 3,
    B = 2,
    C = 4,
    D = 2;

const mathCompare = (a1, a2, b1, b2) => {
    return Math.abs(Math.log(a1/a2)) < Math.abs(Math.log(b1/b2));
};

const logicCompare = (a1, a2, b1, b2) => {
    return (a1 > a2 ? a1/a2 : a2/a1) < (b1 > b2 ? b1/b2 : b2/b1);
};

const runLooped = (j) => {
    for(let i = 0; i < 4; i++) {
        jsCompare(mathCompare, logicCompare, j);
    }
}

const jsCompare = (f1, f2, iterations) => {
    let a = jsPerf(f1, iterations);
    let b = jsPerf(f2, iterations);
    console.warn( iterations + " iterations:\n" + f1.name + ": " + a + " ms\n" + f2.name + ": " + b + " ms\n" + "delta: " + (a-b) + " ms");
}

const jsPerf = (f, iterations) => {
    let start = performance.now();

    for(let i = 0; i < iterations; i++) {
        f(A, B, C, D);
    }
    return performance.now() - start;
}

runLooped(10000000);

Running sets of 10M iterations n times:

logicCompare(): all sets take ~170ms
mathCompare(): first set takes ~14ms, the following sets take ~600ms.

As performance only changed after a complete set of iterations - one call of jsCompare() - I decided to try again with a less de-composed structure:

const A = 3,
    B = 2,
    C = 4,
    D = 2;

const mathCompare = (a1, a2, b1, b2) => {
    return Math.abs(Math.log(a1/a2)) < Math.abs(Math.log(b1/b2));
};

const logicCompare = (a1, a2, b1, b2) => {
    return (a1 > a2 ? a1/a2 : a2/a1) < (b1 > b2 ? b1/b2 : b2/b1);
};

const compareRaw = (f1, f2, maxI, maxJ) => {
    for(let i = 0; i < maxI; i++) {
        let j,
            a = performance.now();
        for(j = 0; j < maxJ; j++) {
            f1(A, B, C, D);
        }
        let b = performance.now();
        for(j = 0; j < maxJ; j++) {
            f2(A, B, C, D);
        }
        let c = performance.now();
        console.warn( j + " iterations:\n" + f1.name + ": " + (b-a) + " ms\n" + f2.name + ": " + (c-b) + " ms\n" + "delta: " + ((b-a) - (c-b)) + " ms");
    }
};

const runRaw = (i) => {
    compareRaw(mathCompare, logicCompare, 4, i);
};

runRaw(10000000);

A completely different outcome. Results stabilize around set 3, after some fluctuations.

Chrome 79.0.3945.130:
- logicCompare(): all sets take ~12ms
- mathCompare(): all sets take ~10ms
Vivaldi 2.6.1566.49 (V8 7.5.288.30):
- logicCompare(): all sets take ~60ms
- mathCompare(): all sets take ~10ms

I was intrigued and tried everything again, but this time with random numbers. The project I was testing this for would obviously never call these function n * 10M times with the same parameters.

const mathCompare = (a1, a2, b1, b2) => {
    return Math.abs(Math.log(a1/a2)) < Math.abs(Math.log(b1/b2));
};

const logicCompare = (a1, a2, b1, b2) => {
    return (a1 > a2 ? a1/a2 : a2/a1) < (b1 > b2 ? b1/b2 : b2/b1);
};

const compareRawRandom = (f1, f2, maxI, maxJ) => {
    let randoms = [...Array(maxJ + 3)].map(()=>Math.floor(Math.random()*10));
    for(let i = 0; i < maxI; i++) {
        let j,
            a = performance.now();
        for(j = 0; j < maxJ; j++) {
            f1(randoms[j], randoms[j + 1], randoms[j + 2], randoms[j + 3]);
        }
        let b = performance.now();
        for(j = 0; j < maxJ; j++) {
            f2(randoms[j], randoms[j + 1], randoms[j + 2], randoms[j + 3]);
        }
        let c = performance.now();
        console.warn( j + " iterations:\n" + f1.name + ": " + (b-a) + " ms\n" + f2.name + ": " + (c-b) + " ms\n" + "delta: " + ((b-a) - (c-b)) + " ms");
    }
}

const runRawRandom = (i) => {
    compareRawRandom(mathCompare, logicCompare, 4, i);
};

const jsCompareRandom = (f1, f2, iterations) => {
    let randoms = [...Array(iterations + 3)].map(()=>Math.floor(Math.random()*10));
    let a = jsPerfRandom(f1, iterations, randoms);
    let b = jsPerfRandom(f2, iterations, randoms);
    console.warn( iterations + " iterations:\n" + f1.name + ": " + a + " ms\n" + f2.name + ": " + b + " ms\n" + "delta: " + (a-b) + " ms");
}

const jsPerfRandom = (f, iterations, randoms) => { 
    let start = performance.now();

    for(let i = 0; i < iterations; i++) {
        f(randoms[i], randoms[i + 1], randoms[i + 2], randoms[i + 3]);
    }
    return performance.now() - start;
}

const runRandomLooped = (j) => {
    for(let i = 0; i < 4; i++) {
        jsCompareRandom(mathCompare, logicCompare, j);
    }
}

runRandomLooped(10000000);
runRawRandom(10000000);

runRandomLooped() shows the same weird low first 10M set for mathCompare().

logicCompare(): all sets take ~280ms
mathCompare(): first set ~27ms, following sets ~800ms

runRawRandom() the less-decomposed version doing exactly the same calculations, however, again stabilizes after 2 sets of 10M. But this time both functions show identical performance of ~23ms for 10M calls.

This only shows up on Chrome/Chromium browsers. Tested on:

Chrome 79.0.3945.130
Vivaldi 2.6.1566.49 ( uses V8 7.5.288.30 )

I also tested on Firefox 72.0.2 which showed constant performance over sets and both ways of looping.

logicCompare(): ~110ms
mathCompare(): ~35ms

I'm running an AMD FX-8350 on current Win10.

I guess it's something to do with how V8 optimizes during runtime, but I would not have expected performance to go down in that case.

Solution

V8 developer here. As wOxxOm points out, this is mostly an illustration of the pitfalls of microbenchmarking.

First off:

Unexpected de-optimisation...

Nope, deoptimization is not the issue here (it's a very specific term with a very specific meaning). You mean "slowdown".

...on high-iteration count

Nope, high iteration count is also not the issue here. While you could say that the lack of warmup in your test is contributing to the results you see, that contribution is the part you found less surprising.

One mechanism to be aware of is "on-stack replacement": functions with (very) long-running loops will get optimized while the loop is executing. It doesn't make sense to do that on a background thread, so execution is interrupted while optimization happens on the main thread. If the code after the loop hasn't been executed yet and therefore doesn't have type feedback, then the optimized code will be thrown away ("deoptimized") once execution reaches the end of the loop, to collect type feedback while executing unoptimized bytecode. In case of another long-running loop like in the example here, the same OSR-then-deopt dance will be repeated. That means some nontrivial fraction of what you're measuring is optimization time. This explains most of the variance you see in runRawRandom before times stabilize.

Another effect to be aware of is inlining. The smaller and faster the function in question, the bigger the call overhead, which is avoided when you write the benchmark such that the function can get inlined. Additionally, inlining often unlocks additional optimization opportunities: in this case, the compiler can see after inlining that the comparison result is never used, so it dead-code-eliminates all the comparisons. This explains why runRandomLooped is so much slower than runRawRandom: the latter benchmarks empty loops. The former's very first iteration is "fast" (=empty) because V8 at that point inlines mathCompare for the f(...) call in jsPerfRandom (because that's the only function it's ever seen there), but soon after it realizes "whoops, various different functions are getting called here", so it deopts and won't try to inline again on subsequent optimization attempts.

If you care about the details, you can use some combination of the flags --trace-opt --trace-deopt --trace-osr --trace-turbo-inlining --print-opt-code --code-comments to investigate the behavior in depth. Be warned though that while this exercise is likely going to cost you significant time, what you can learn from the behavior of a microbenchmark is, in all likelihood, not going to be relevant for real-world use cases.

To illustrate:

you have one microbenchmark here that proves beyond doubt that mathCompare is much slower than logicCompare
you have another microbenchmark that proves beyond doubt that both have the same performance
your aggregate observations prove beyond doubt that performance goes down when V8 decides to optimize things

In practice though, all three observations are false (which is not overly surprising given that two of them are direct contradictions of each other):

the "fast results" are not reflective of real-world behavior because they could dead-code eliminate the workload you were trying to measure
the "slow results" are not reflective of real-world behavior because the specific way the benchmark was written prevented inlining of a tiny function (which would virtually always get inlined in real code)
the supposed "performance going down" was just a microbenchmarking artifact, and was not at all due to failed/useless optimizations, and would not occur in a real-world situation.