How to get more meaningful statistics from multiple runs of a haskell benchmark

I'm running some fairly simple benchmarks with the benchpress library. I've been using the bench :: Int -> IO a -> IO () interface. However, it seems like if I run a given function n times, all runs after the first are extremely fast.

As a simple example, bench 1 (seq (sum [1..100000]) (return ())) might take 10 or so seconds. However, bench 5 (seq (sum [1..100000]) (return ())) will produce a report like this:

Times (ms)
   min    mean    +/-sd  median    max 
  0.001   2.657   5.937   0.001  13.277

Percentiles (ms)
  50%  0.001
  66%  0.002
  75%  0.002
  80%  0.002
  90%  13.277
  95%  13.277
  98%  13.277
  99%  13.277
 100%  13.277

Since the mean is 2.6, I can infer that the first run took 13 seconds, and the other 4 were extremely fast.

Why does this happen? How can I ensure that all runs of a benchmark are representative? The library has a more fine-grained interface as well: benchmark :: Int -> IO a -> (a -> IO b) -> (a -> IO c) -> IO (Stats, Stats). This would let me provide setup and teardown functions -- can I use this interface to get more meaningful results?

Solution

I'd recommend using criterion. It's been carefully designed with facilities to time pure computations (which, as you've discovered, can be tricky). I'm not familiar with benchpress, but it doesn't seem to have those same facilities out of the box, and appears to be aimed primarily at benchmarking IO operations.

Benchmarking your example in criterion would look something like this:

import Criterion.Main

main = defaultMain
  [ bench "my summation" $ whnf sum [1..100000] ]

Benchmarks run from GHCi and ghc with no optimization flags are largely meaningless, so it's important to compile this with ghc -O2. Running it will produce the output:

benchmarking my summation
time                 9.393 ms   (9.271 ms .. 9.498 ms)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 9.385 ms   (9.292 ms .. 9.483 ms)
std dev              268.7 μs   (208.4 μs .. 334.0 μs)

You can see here that the timings varied from a minimum of 9.3 ms to 9.5 ms, so there were no large outliers. However, Criterion automatically discards an initial run to ensure that costs that are only incurred the first time code is run (a common occurrence with GHC code) don't get included in the timings.

The whnf function is a magic function that ensures that even though its two arguments might be fully evaluated after the first run and are therefore fully formed in memory, the application of its first argument to its second will be genuinely repeated for each run, and evaluation will proceed far enough to place the result in "weak head normal form". The weak head normal form of a number (like the sum of a bunch of integers) is the number itself, so for this benchmark, the timing is for the evaluation of the actual numeric sum.

It's important to understand what parts of this computation aren't being benchmarked. The expression [1..100000] constructs a list. Provided the list isn't optimized away (and in this benchmark it isn't), the construction of the list, as a singly-linked list of boxed Integers held entirely in memory, is performed on the first discarded iteration, and the timing being benchmarked here is the traversal of the constructed list to sum its elements. You could time the construction and summing of the list together with:

bench "construct and sum" $ whnf (\n -> sum [1..n]) 100000

but this produces an unexpectedly faster result:

benchmarking construct and sum
time                 1.299 ms   (1.288 ms .. 1.314 ms)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 1.290 ms   (1.285 ms .. 1.297 ms)
std dev              20.77 μs   (14.74 μs .. 27.59 μs)

because the list is optimized away by list fusion, and you are now benchmarking a tight summation loop.

If you really want to time construction and summation of an explicit list, you can prevent the list fusion with a copy of sum that doesn't inline:

sum' :: (Num a) => [a] -> a
{-# NOINLINE sum' #-}
sum' = sum

...bench "construct and sum w/o fusion" $ whnf (\n -> sum' [1..n]) 100000...

All that is to say, benchmarking GHC code is tricky, but using criterion is almost mandatory.

A full example:

import Criterion.Main

{-# NOINLINE sum' #-}
sum' :: (Num a) => [a] -> a
sum' = sum

main = defaultMain
  [ bench "sum an in-memory list" $ whnf sum [1..100000]
  , bench "construct and sum w/ fusion" $ whnf (\n -> sum [1..n]) 100000
  , bench "construct and sum w/o fusion" $ whnf (\n -> sum' [1..n]) 100000
  , bench "Int (vs. Integer) and fusion" $ whnf (\n -> sum[(1::Int)..n]) 100000
  ]

Roughly the timings I get with ghc -O2 are 9ms, 1ms, 14ms, and 47μs. Note that Ints are extremely fast compared to Integers, something to keep in mind if you aren't using explicit type signatures and inadvertently defaulting to Integer.

Here, the difference has less to do with the datatype itself and more to do with the combination of unboxing and fusion. That final benchmark gets compiled to a fairly tight assembly loop that adds the numbers from 1 to 100000 in a register.

Actually, the native code generator does a bad job here. The LLVM backend (ghc -O2 -fllvm) gets the Int version down to 100 nanoseconds. When you get timings this small, it's a good idea to scale up the problem to make sure you're actually measuring what you think you're measuring. If I scale up the list length by 10x, the timings all scale up by 10x, so I can be reasonably confident that I'm timing the actual summation, as intended.