Search code examples
performancehaskellstrictness

Profiling a Haskell program


I have a piece of code that repeatedly samples from a probability distribution using sequence. Morally, it does something like this:

sampleMean :: MonadRandom m => Int -> m Float -> m Float
sampleMean n dist = do
  xs <- sequence (replicate n dist)
  return (sum xs)

Except that it's a bit more complicated. The actual code I'm interested in is the function likelihoodWeighting at this Github repo.

I noticed that the running time scales nonlinearly with n. In particular, once n exceeds a certain value it hits the memory limit, and the running time explodes. I'm not certain, but I think this is because sequence is building up a long list of thunks which aren't getting evaluated until the call to sum.

Once I get past about 100,000 samples, the program slows to a crawl. I'd like to optimize this (my feeling is that 10 million samples shouldn't be a problem) so I decided to profile it - but I'm having a little trouble understanding the output of the profiler.


Profiling

I created a short executable in a file main.hs that runs my function with 100,000 samples. Here's the output from doing

$ ghc -O2 -rtsopts main.hs
$ ./main +RTS -s

First things I notice - it allocates nearly 1.5 GB of heap, and spends 60% of its time on garbage collection. Is this generally indicative of too much laziness?

 1,377,538,232 bytes allocated in the heap
 1,195,050,032 bytes copied during GC
   169,411,368 bytes maximum residency (12 sample(s))
     7,360,232 bytes maximum slop
           423 MB total memory in use (0 MB lost due to fragmentation)

Generation 0:  2574 collections,     0 parallel,  2.40s,  2.43s elapsed
Generation 1:    12 collections,     0 parallel,  1.07s,  1.28s elapsed

INIT  time    0.00s  (  0.00s elapsed)
MUT   time    1.92s  (  1.94s elapsed)
GC    time    3.47s  (  3.70s elapsed)
RP    time    0.00s  (  0.00s elapsed)
PROF  time    0.23s  (  0.23s elapsed)
EXIT  time    0.00s  (  0.00s elapsed)
Total time    5.63s  (  5.87s elapsed)

%GC time      61.8%  (63.1% elapsed)

Alloc rate    716,368,278 bytes per MUT second

Productivity  34.2% of total user, 32.7% of total elapsed

Here are the results from

$ ./main +RTS -p

The first time I ran this, it turned out that there was one function being called repeatedly, and it turned out I could memoize it, which sped things up by a factor of 2. It didn't solve the space leak, however.

COST CENTRE           MODULE                no. entries  %time %alloc   %time %alloc

MAIN                  MAIN                    1        0   0.0    0.0   100.0  100.0
 main                 Main                  434        4   0.0    0.0   100.0  100.0
  likelihoodWeighting AI.Probability.Bayes  445        1   0.0    0.3   100.0  100.0
   distributionLW     AI.Probability.Bayes  448        1   0.0    2.6     0.0    2.6
   getSampleLW        AI.Probability.Bayes  446   100000  20.0   50.4   100.0   97.1
    bnProb            AI.Probability.Bayes  458   400000   0.0    0.0     0.0    0.0
    bnCond            AI.Probability.Bayes  457   400000   6.7    0.8     6.7    0.8
    bnVals            AI.Probability.Bayes  455   400000  20.0    6.3    26.7    7.1
     bnParents        AI.Probability.Bayes  456   400000   6.7    0.8     6.7    0.8
    bnSubRef          AI.Probability.Bayes  454   800000  13.3   13.5    13.3   13.5
    weightedSample    AI.Probability.Bayes  447   100000  26.7   23.9    33.3   25.3
     bnProb           AI.Probability.Bayes  453   100000   0.0    0.0     0.0    0.0
     bnCond           AI.Probability.Bayes  452   100000   0.0    0.2     0.0    0.2
     bnVals           AI.Probability.Bayes  450   100000   0.0    0.3     6.7    0.5
      bnParents       AI.Probability.Bayes  451   100000   6.7    0.2     6.7    0.2
     bnSubRef         AI.Probability.Bayes  449   200000   0.0    0.7     0.0    0.7

Here's a heap profile. I don't know why it claims the runtime is 1.8 seconds - this run took about 6 seconds.

enter image description here

Can anyone help me to interpret the output of the profiler - i.e. to identify where the bottleneck is, and provide suggestions for how to speed things up?


Solution

  • A huge improvement has already been achieved by incorporating JohnL's suggestion of using foldM in likelihoodWeighting. That reduced memory usage about tenfold here, and brought down the GC times significantly to almost or actually negligible.

    A profiling run with the current source yields

    probabilityIO              AI.Util.Util          26.1   42.4    413 290400000
    weightedSample.go          AI.Probability.Bayes  16.1   19.1    255 131200080
    bnParents                  AI.Probability.Bayes  10.8    1.2    171   8000384
    bnVals                     AI.Probability.Bayes  10.4    7.8    164  53603072
    bnCond                     AI.Probability.Bayes   7.9    1.2    125   8000384
    ndSubRef                   AI.Util.Array          4.8    9.2     76  63204112
    bnSubRef                   AI.Probability.Bayes   4.7    8.1     75  55203072
    likelihoodWeighting.func   AI.Probability.Bayes   3.3    2.8     53  19195128
    %!                         AI.Util.Util           3.3    0.5     53   3200000
    bnProb                     AI.Probability.Bayes   2.5    0.0     40        16
    bnProb.p                   AI.Probability.Bayes   2.5    3.5     40  24001152
    likelihoodWeighting        AI.Probability.Bayes   2.5    2.9     39  20000264
    likelihoodWeighting.func.x AI.Probability.Bayes   2.3    0.2     37   1600000
    

    and 13MB memory usage reported by -s, ~5MB maximum residency. That's not too bad already.

    Still, there remain some points we can improve. First, a relatively minor thing, in the grand scheme, AI.UTIl.Array.ndSubRef:

    ndSubRef :: [Int] -> Int
    ndSubRef ns = sum $ zipWith (*) (reverse ns) (map (2^) [0..])
    

    Reversing the list, and mapping (2^) over another list is inefficient, better is

    ndSubRef = L.foldl' (\a d -> 2*a + d) 0
    

    which doesn't need to keep the entire list in memory (probably not a big deal, since the lists will be short) as reversing it does, and doesn't need to allocate a second list. The reduction in allocation is noticeable, about 10%, and that part runs measurably faster,

    ndSubRef                   AI.Util.Array          1.7    1.3     24   8000384
    

    in the profile of the modified run, but since it takes only a small part of the overall time, the overall impact is small. There are potentially bigger fish to fry in weightedSample and likelihoodWeighting.

    Let's add a bit of strictness in weightedSample to see how that changes things:

    weightedSample :: Ord e => BayesNet e -> [(e,Bool)] -> IO (Map e Bool, Prob)
    weightedSample bn fixed =
        go 1.0 (M.fromList fixed) (bnVars bn)
        where
            go w assignment []     = return (assignment, w)
            go w assignment (v:vs) = if v `elem` vars
                then
                    let w' = w * bnProb bn assignment (v, fixed %! v)
                    in go w' assignment vs
                else do
                    let p = bnProb bn assignment (v,True)
                    x <- probabilityIO p
                    go w (M.insert v x assignment) vs
    
            vars = map fst fixed
    

    The weight parameter of go is never forced, nor is the assignment parameter, thus they can build up thunks. Let's enable {-# LANGUAGE BangPatterns #-} and force updates to take effect immediately, also evaluate p before passing it to probabilityIO:

    go w assignment (v:vs) = if v `elem` vars
        then
            let !w' = w * bnProb bn assignment (v, fixed %! v)
            in go w' assignment vs
        else do
            let !p = bnProb bn assignment (v,True)
            x <- probabilityIO p
            let !assignment' = M.insert v x assignment
            go w assignment' vs
    

    That brings a further reduction in allocation (~9%) and a small speedup (~%13%), but the total memory usage and maximum residence haven't changed much.

    I see nothing else obvious to change there, so let's look at likelihoodWeighting:

    func m _ = do
        (a, w) <- weightedSample bn fixed
        let x = a ! e
        return $! x `seq` w `seq` M.adjust (+w) x m
    

    In the last line, first, w is already evaluated in weightedSample now, so we don't need to seq it here, the key x is required to evaluate the updated map, so seqing that isn't necessary either. The bad thing on that line is M.adjust. adjust has no way of forcing the result of the updated function, so that builds thunks in the map's values. You can force evaluation of the thunks by looking up the modified value and forcing that, but Data.Map provides a much more convenient way here, since the key at which the map is updated is guaranteed to be present, insertWith':

    func !m _ = do
        (a, w) <- weightedSample bn fixed
        let x = a ! e
        return (M.insertWith' (+) x w m)
    

    (Note: GHC optimises better with a bang-pattern on m than with return $! ... here). That slightly reduces the total allocation and doesn't measurably change the running time, but has a great impact on total memory used and maximum residency:

     934,566,488 bytes allocated in the heap
       1,441,744 bytes copied during GC
          68,112 bytes maximum residency (1 sample(s))
          23,272 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)
    

    The biggest improvement in running time to be had would be by avoiding randomIO, the used StdGen is very slow.

    I am surprised how much time the bn* functions take, but don't see any obvious inefficiency in those.