performance optimization benchmarking micro-optimization microbenchmark

How do you reason about fluctuations in benchmarking data?

Suppose you're trying to optimize a function and using some benchmarking framework (like Google Benchmark) for measurement. You run the benchmarks on the original function 3 times and see average wall clock time/CPU times of 100 ms, 110 ms, 90 ms. Then you run the benchmarks on the "optimized" function 3 times and see 80 ms, 95 ms, 105 ms. (I made these numbers up). Do you conclude that your optimizations were successful?

Another problem I often run into is that I'll go do something else and run the benchmarks later in the day and get numbers that are further away than the delta between the original and optimized earlier in the day (say, 80 ms, 85 ms, 75 ms for the original function).

I know there are statistical methods to determine whether the improvement is "significant". Do software engineers actually use these formal calculations in practice?

I'm looking for some kind of process to follow when optimizing code.

Solution

Rule of Thumb

Minimum(!) of each series => 90ms vs 80ms
Estimate noise => ~ 10ms
Pessimism => It probably didn't get any slower.

Not happy yet?

Take more measurements. (~13 runs each)
Interleave the runs. (Don't measure 13x A followed by 13x B.)

Ideally you always randomize whether you run A or B next (scientific: randomized trial), but it's probably overkill. Any source of error should affect each variant with the same probability. (Like the CPU building up heat over time, or a background task starting after run 11.)
Go back to step 1.

Still not happy? Time to admit it that you've been nerd-sniped. The difference, if it exists, is so small that you can't even measure it. Pick the more readable variant and move on. (Or alternatively, lock your CPU frequency, isolate a core just for the test, quiet down your system...)

Explanation

Minimum: Many people (and tools, even) take the average, but the minimum is statistically more stable. There is a lower limit how fast your benchmark can run on a given hardware, but no upper limit much it can get slowed down by other programs. Also, taking the minimum will automatically drop the initial "warm-up" run.
Noise: Apply common sense, just glance over the numbers. If you look a the standard deviation, make that look very skeptical! A single outlier will influence it so much that it becomes nearly useless. (It's not a normal distribution, usually.)
Pessimism: You were really clever to find this optimization, you really want the optimized version to be faster! If it looks better just by chance, you will believe it. (You knew it!) So if you care about being correct, you must counter this tendency.

Disclaimer

Those are just basic guidelines. Worst-case latency is relevant in some applications (smooth animations or motor control), but it will be harder to measure. It's easy (and fun!) to optimize something that doesn't matter in practice. Instead of wondering if your 1% gain is statistically significant, try something else. Measure the full program including OS overhead. Comment out code, or run work twice, only to check if optimizing it might be worth it.