Search code examples
performancehaskellfusion

Why doesn't the List fuse as well as the Vector? (Haskell)


Consider the following benchmark:

module Main where

import qualified Data.List as L
import qualified Data.Vector.Unboxed as U
import Criterion.Main



goodSum :: Int -> Double
{-# NOINLINE goodSum #-}
goodSum n =
  let ints = U.enumFromN 0 (n * n * 10) :: U.Vector Int
  in U.foldl' (+) 0 $ U.map fromIntegral ints

badSum :: Int -> Double
{-# NOINLINE badSum #-}
badSum n = L.foldl' (+) 0.5 [fromIntegral i | i <- [0 .. 10*n*n]]

badSum2 :: Int -> Double
{-# NOINLINE badSum2 #-}
badSum2 n = L.foldr (+) 0.5 [fromIntegral i | i <- [0 .. 10*n*n]]

worstSum :: Int -> Double
{-# NOINLINE worstSum #-}
worstSum n = L.foldl1' (+) $ do
  i <- [0 .. n*n]
  return $ L.foldl1' (+) $ do
    k <- [0 .. 10]
    return $ fromIntegral $ k + i

main = do
  defaultMain
    [ bench "good" $ nf goodSum 500
    , bench "bad" $ nf badSum 500
    , bench "bad2" $ nf badSum2 500
    , bench "worst" $ nf worstSum 500
    ]

The results:

benchmarking good
time                 1.826 ms   (1.819 ms .. 1.835 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.810 ms   (1.803 ms .. 1.817 ms)
std dev              23.18 μs   (19.91 μs .. 27.96 μs)

benchmarking bad
time                 38.38 ms   (38.07 ms .. 38.74 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 38.23 ms   (38.07 ms .. 38.38 ms)
std dev              298.5 μs   (220.6 μs .. 417.8 μs)

benchmarking bad2
time                 77.87 ms   (73.74 ms .. 82.67 ms)
                     0.992 R²   (0.976 R² .. 0.998 R²)
mean                 78.14 ms   (75.33 ms .. 82.13 ms)
std dev              5.184 ms   (3.056 ms .. 7.966 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking worst
time                 80.80 ms   (79.53 ms .. 82.10 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 80.73 ms   (80.29 ms .. 81.19 ms)
std dev              756.9 μs   (591.9 μs .. 938.2 μs)

List comprehensions are good producers and foldr is a good consumer, so why didn't the List fuse?


Solution

  • Contrary to your question, foldr and the list comprehension did, in fact, fuse. However, you have to recall the definition of foldr and not that it is not a tail recursive function. The prelude defines foldr as, which will not compile down to a tight loop like the Vector based example.

    foldr k z = go
      where
        go []     = z
        go (y:ys) = y `k` go ys
    

    The important bit of core generated for badSum2 looks like this

    $wgo_s8AH [Occ=LoopBreaker] :: Int# -> Double#
    $wgo_s8AH =
      \ (w_s8AD :: Int#) ->
        case tagToEnum#
               @ Bool (==# w_s8AD y1_a69V)
        of _ [Occ=Dead] {
          False ->
            case $wgo_s8AH (+# w_s8AD 1) of ww_s8AG { __DEFAULT ->
            +## (int2Double# w_s8AD) ww_s8AG
            };
          True -> +## (int2Double# w_s8AD) 0.5
    

    which is roughly equivalent to this function (modulo unboxed arithmetic)

    badSum3 :: Int -> Double
    badSum3 n = go 0
      where
        stop = 10 * n * n
    
        go i | i == stop = fromIntegral i + 0.5
             | otherwise = fromIntegral i + go (i + 1)
    

    Running this through Criterion, this function gives the same runtime as badSum2. While the generated function is not producing and consuming intermediate cons cells, it is still performing function calls and doing all the associated stack operations.

    The poor performance for for the foldl' based version is due to the fact that foldl' is not a good consumer, so it cannot fuse with the list comprehension. The left-fold will produce a tail recursive loop which walks the list produced by the list comprehension, incurring the overhead of all the allocation and associated memory operations.

    I am not sure if you can get the same performance as the Vector operations using the standard list operations, but the stream-fusion package provides list combinators using a different fusion method, which can achieve similar performance for this problem.

    import qualified Data.List.Stream as S
    
    -- The stream-fusion package does not seem to have `enumFrom` functions
    enumerate :: Int -> [Int]
    enumerate n = S.unfoldr f 0
      where
        f i | i > n     = Nothing
            | otherwise = Just (i, i + 1)
    
    goodSum2 :: Int -> Double
    goodSum2 n = S.foldl' (+) 0.5 $ S.map fromIntegral $ enumerate (n * n * 10)