Search code examples
performancehaskellstreammonadsffi

Squeezing more performance out of monadic streams in Haskell


The most straightforward monadic 'stream' is just a list of monadic actions Monad m => [m a]. The sequence :: [m a] -> m [a] function evaluates each monadic action and collects the results. As it turns out, sequence is not very efficient, though, because it operates on lists, and the monad is an obstacle to achieving fusion in anything but the simplest cases.

The question is: What is the most efficient approach for monadic streams?

To investigate this, I provide a toy problem along with a few attempts to improve performance. The source code can be found on github. The singular benchmark presented below may be misleading for more realistic problems, although I think it is a worst-case scenario of sorts, i.e. the most possible overhead per useful computation.

The toy problem

is a maximum length 16-bit Linear Feedback Shift Register (LFSR), implemented in C in a somewhat over-elaborate way, with a Haskell wrapper in the IO monad. 'Over-elaborate' refers to the unnecessary use of a struct and its malloc - the purpose of this complication is to make it more similar to realistic situations where all you have is a Haskell wrapper around a FFI to a C struct with OO-ish new, set, get, operate semantics (i.e. very much the imperative style). A typical Haskell program looks like this:

import LFSR

main = do
    lfsr <- newLFSR            -- make a LFSR object
    setLFSR lfsr 42            -- initialise it with 42 
    stepLFSR lfsr              -- do one update
    getLFSR lfsr >>= print     -- extract the new value and print

The default task is to calculate the average of the values (doubles) of 10'000'000 iterations of the LFSR. This task could be part of a suite of tests to analyse the 'randomness` of this stream of 16-bit integers.

0. Baseline

The baseline is the C implementation of the average over n iterations:

double avg(state_t* s, int n) {
    double sum = 0;
    for(int i=0; i<n; i++, sum += step_get_lfsr(s));
    return sum / (double)n;
}

The C implementation isn't meant to be particularly good, or fast. It just provides a meaningful computation.

1. Haskell lists

Compared to the C baseline, on this task Haskell lists are 73x slower.

=== RunAvg =========
Baseline: 1.874e-2
IO:       1.382488
factor:   73.77203842049093

This is the implementation (RunAvg.hs):

step1 :: LFSR -> IO Word32
step1 lfsr = stepLFSR lfsr >> getLFSR lfsr

avg :: LFSR -> Int -> IO Double
avg lfsr n = mean <$> replicateM n (step1 lfsr) where
    mean :: [Word32] -> Double
    mean vs = (sum $ fromIntegral <$> vs) / (fromIntegral n)

2. Using the streaming library

This gets us to within 9x of the baseline,

=== RunAvgStreaming ===
Baseline: 1.9391e-2
IO:       0.168126
factor:   8.670310969006241

(Note that the benchmarks are rather inaccurate at these short execution times.)

This is the implementation (RunAvgStreaming.hs):

import qualified Streaming.Prelude as S
avg :: LFSR -> Int -> IO Double
avg lfsr n = do
    let stream = S.replicateM n (fromIntegral <$> step1 lfsr :: IO Double)
    (mySum :> _) <- S.sum stream
    return (mySum / fromIntegral n)

3. Using Data.Vector.Fusion.Stream.Monadic

This gives the best performance so far, within 3x of baseline,

=== RunVector =========
Baseline: 1.9986e-2
IO:       4.9146e-2
factor:   2.4590213149204443

As usual, here is the implementation (RunAvgVector.hs):

import qualified Data.Vector.Fusion.Stream.Monadic as V
avg :: LFSR -> Int -> IO Double
avg lfsr n = do
   let stream = V.replicateM n (step1' lfsr)
   V.foldl (+) 0.0 stream

I didn't expect to find a good monadic stream implementation under Data.Vector. Other than providing fromVector and concatVectors, Data.Vector.Fusion.Stream.Monadic has very little to do with Vector from Data.Vector.

A look at the profiling report shows that Data.Vector.Fusion.Stream.Monadic has a considerable space leak, but that doesn't sound right.

4. Lists aren't necessarily slow

For very simple operations lists aren't terrible at all:

=== RunRepeat =======
Baseline: 1.8078e-2
IO:       3.6253e-2
factor:   2.0053656377917912

Here, the for loop is done in Haskell instead of pushing it down to C (RunRepeat.hs):

do
   setLFSR lfsr 42
   replicateM_ nIter (stepLFSR lfsr)
   getLFSR lfsr

This is just a repetition of calls to stepLFSR without passing the result back to the Haskell layer. It gives an indication of what impact the overhead for calling the wrapper and the FFI has.

Analysis

The repeat example above shows that most, but not all (?), of the performance penalty comes from overhead of calling the wrapper and/or the FFI. But I'm not sure where to look for tweaks, now. Maybe this is just as good as it gets with regards to monadic streams, and in fact this is all about trimming down the FFI, now...

Sidenotes

  1. The fact that LFSRs are chosen as a toy problem does not imply that Haskell isn't able to do these efficiently - see the SO question "Efficient bit-fiddling in a LFSR implementation ".
  2. Iterating a 16-bit LFSR 10M times is a rather silly thing to do. It will take at most 2^16-1 iterations to reach the starting state again. In a maximum length LFSR it will take exactly 2^16-1 iterations.

Update 1

An attempt to remove the withForeignPtr calls can be made by introducing a Storable and then using alloca :: Storable a => (Ptr a -> IO b) -> IO b

repeatSteps :: Word32 -> Int -> IO Word32
repeatSteps start n = alloca rep where
    rep :: Ptr LFSRStruct' -> IO Word32
    rep p = do
        setLFSR2 p start
        (sequence_ . (replicate n)) (stepLFSR2 p)
        getLFSR2 p

where LFSRStruct' is

data LFSRStruct' = LFSRStruct' CUInt

and the wrapper is

foreign import ccall unsafe "lfsr.h set_lfsr"
    setLFSR2 :: Ptr LFSRStruct' -> Word32 -> IO ()

-- likewise for setLFSR2, stepLFSR2, ...

See RunRepeatAlloca.hs and src/LFSR.hs. Performance-wise this makes no difference (within timing variance).

=== RunRepeatAlloca =======
Baseline: 0.19811199999999998
IO:       0.33433
factor:   1.6875807623970283

Solution

  • After deciphering GHC's assembly product for RunRepeat.hs I'm coming to this conclusion: GHC won't inline the call to the C function step_lfsr(state_t*), whereas the C compiler will, and this makes all the difference for this toy problem.

    I can demonstrate this by forbidding inlining with the __attribute__ ((noinline)) pragma. Overall, the C executable gets slower, hence the gap between Haskell and C closes.

    Here are the results:

    === RunRepeat =======
    #iter:    100000000
    Baseline: 0.334414
    IO:       0.325433
    factor:   0.9731440669349967
    === RunRepeatAlloca =======
    #iter:    100000000
    Baseline: 0.330629
    IO:       0.333735
    factor:   1.0093942152684732
    === RunRepeatLoop =====
    #iter:    100000000
    Baseline: 0.33195399999999997
    IO:       0.33791
    factor:   1.0179422450098508
    

    I.e. there is no penalty for FFI calls to lfsr_step anymore.

    === RunAvg =========
    #iter:    10000000
    Baseline: 3.4072e-2
    IO:       1.3602589999999999
    factor:   39.92307466541442
    === RunAvgStreaming ===
    #iter:    50000000
    Baseline: 0.191264
    IO:       0.666438
    factor:   3.484388070938598
    

    Good old lists don't fuse, hence the huge performance hit, and the streaming library also isn't optimal. But Data.Vector.Fusion.Stream.Monadic gets within 20% of the C performance:

    === RunVector =========
    #iter:    200000000
    Baseline: 0.705265
    IO:       0.843916
    factor:   1.196594188000255
    

    It has been observed already that GHC doesn't inline FFI calls: "How to force GHC to inline FFI calls?" .

    For situations where the benefit of inlining is so high, i.e. the workload per FFI call is so low, it might be worth looking into inline-c.