Search code examples
haskellpointfree

Why does the pointfree version of my function use much more memory


I was working on a Project Euler problem and ended up with a Haskell file that included a function that looked like this:

matches :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
matches f cs = foldr (\(cs', n) a -> fromBool (f cs cs') * n + a) 0

With fromBool imported from Foreign.Marshal.Utils just to quickly convert True to 1 and False to 0.

I was trying to get a little more speed out of my solution so I tried switching from foldr to foldl' (switching the arguments in the process) as I assumed foldr didn't make much sense to use on numbers.

Switching from foldr to foldl' caused me to allocate more than twice as much memory according to GHC's profiler.

For fun I also decided to replace the lambda with a pointfree version of the function:

matches :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
matches f cs = foldr ((+) . uncurry ((*) . fromBool . f cs)) 0

This caused my memory allocation to increase 20x from the foldr version.

Now this isn't a huge deal as even in the 20x case the total memory allocation was only about 135Mb and the runtime of the program was relatively unaffected, if anything the higher memory allocation versions ran slightly faster.

But I am really curious as to how these results could be possible, so that in future I will be able to pick the "right" function when I don't have as much leeway.

EDIT:

GHC version 7.10.2, compiled with -O2 -prof -fprof-auto. Executed with +RTS -p.

EDIT 2:

Alright it looks like this is too difficult to reproduce to omit the rest of the code, well here is the entire program:

SPOILERS BELOW:

{-# LANGUAGE NoMonomorphismRestriction #-}

import Control.Monad
import Data.List
import Foreign.Marshal.Utils

data Color = Red | Green | Blue deriving (Eq, Enum, Bounded, Show)

colors :: [Color]
colors = [Red ..]

matches :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
matches f x = foldr ((+) . uncurry ((*) . fromBool . f x)) 0
-- matches f x = foldr (\(y, n) a -> fromBool (f x y) * n + a) 0
-- matches f x = foldl' (\a (y, n) -> fromBool (f x y) * n + a) 0

invert :: [([Color], Int)] -> [([Color], Int)]
invert rs = (\cs -> (cs, matches valid cs rs)) <$> choices
  where
    len = maximum $ length . fst <$> rs
    choices = replicateM len colors
    valid (x : xs) (y : ys) = x /= y && valid xs ys
    valid _ _ = True

expand :: [([Color], Int)] -> [([Color], Int)]
expand rs = (\cs -> (cs, matches valid cs rs)) <$> choices
  where
    len = maximum $ length . fst <$> rs
    choices = replicateM (len + 1) colors
    valid (x1 : x2 : xs) (y : ys) = x1 /= y && x2 /= y && valid (x2 : xs) ys
    valid _ _ = True

getRow :: Int -> [([Color], Int)]
getRow 1 = flip (,) 1 . pure <$> colors
getRow n = expand . invert $ getRow (n - 1)

result :: Int -> Int
result n = sum $ snd <$> getRow n

main :: IO ()
main = print $ result 8

Solution

  • Note: This post is written in literate Haskell. Copy it into a file, save it as *.lhs, and compile/load with in GHC(i). Also, I started writing this answer before you've edited your code, but the lesson stays the same.

    TL;DR

    The Prelude function uncurry is too lazy, whereas your pattern match is just strict enough.

    A word of caution and a disclaimer

    We're entering a magical, weird place. Beware. Also, my CORE abilities are rudimentary. Now that I've lost all my credibility, let's get started.

    The tested code

    In order to know where we get the additional memory requirements, it's useful to have more than two functions.

    > import Control.Monad (forM_)
    

    This is your original, non-pointfree variant:

    > matches :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
    > matches    f cs = foldr (\(cs', n) a -> fromEnum (f cs cs') * n + a) 0
    

    This is a variant that's only slightly point-free, the parameter a is eta-reduced.

    > matchesPF' :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
    > matchesPF' f cs = foldr (\(cs', n) -> (+) (fromEnum (f cs cs') * n)) 0
    

    This is a variant that inlines uncurry by hand.

    > matchesPFI :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
    > matchesPFI f cs = foldr ((+) . (\(cs', n) -> fromEnum (f cs cs') * n)) 0
    

    This your pointfree version.

    > matchesPF :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
    > matchesPF  f cs = foldr ((+) . uncurry  ((*) . fromEnum . f cs)) 0
    

    This is a variant that uses a custom uncurry, see below.

    > matchesPFU :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
    > matchesPFU f cs = foldr ((+) . uncurryI ((*) . fromEnum . f cs)) 0
    

    This is a variant that uses a custom lazy uncurry, see below.

    > matchesPFL :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
    > matchesPFL f cs = foldr ((+) . uncurryL ((*) . fromEnum . f cs)) 0
    

    To test the functions easily, we use a list:

    > funcs = [matches, matchesPF', matchesPF, matchesPFL, matchesPFU, matchesPFI]
    

    Our self-written uncurry:

    > uncurryI :: (a -> b -> c) -> (a, b) -> c
    > uncurryI f (a,b) = f a b
    

    A lazier uncurry:

    > uncurryL :: (a -> b -> c) -> (a, b) -> c
    > uncurryL f p = f (fst p) (snd p)
    

    The lazy variant uncurryL has the same semantics as the variant in Prelude, e.g.

    uncurry (\_ _ -> 0) undefined == 0 == uncurryL (\_ _ -> 0) undefined
    

    whereas uncurryI is strict in the pair's spine.

    > main = do
    >   let f a b = a < b
    >   forM_ [1..10] $ \i ->
    >     forM_ funcs $ \m ->
    >       print $ m f i (zip (cycle [1..10]) [1..i*100000])
    

    The list [1..i*100000] depends on i deliberately, so that we don't introduce a CAF and skew our allocation profile.

    The desugared code

    Before we delve into the profile, let's have a look at the desugared code of each function:

    ==================== Desugar (after optimization) ====================
    Result size of Desugar (after optimization)
      = {terms: 221, types: 419, coercions: 0}
    
    uncurryL
    uncurryL = \ @ a @ b @ c f p -> f (fst p) (snd p)
    
    uncurryI
    uncurryI = \ @ a @ b @ c f ds -> case ds of _ { (a, b) -> f a b }
    
    -- uncurried inlined by hand
    matchesPFI =
      \ @ a f cs ->
        foldr
          $fFoldable[]
          (. (+ $fNumInt)
             (\ ds ->
                case ds of _ { (cs', n) ->
                * $fNumInt (fromEnum $fEnumBool (f cs cs')) n
                }))
          (I# 0)
    
    -- lazy uncurry
    matchesPFL =
      \ @ a f cs ->
        foldr
          $fFoldable[]
          (. (+ $fNumInt)
             (uncurryL (. (* $fNumInt) (. (fromEnum $fEnumBool) (f cs)))))
          (I# 0)
    
    -- stricter uncurry
    matchesPFU =
      \ @ a f cs ->
        foldr
          $fFoldable[]
          (. (+ $fNumInt)
             (uncurryI (. (* $fNumInt) (. (fromEnum $fEnumBool) (f cs)))))
          (I# 0)
    
    -- normal uncurry
    matchesPF =
      \ @ a f cs ->
        foldr
          $fFoldable[]
          (. (+ $fNumInt)
             (uncurry (. (* $fNumInt) (. (fromEnum $fEnumBool) (f cs)))))
          (I# 0)
    
    -- eta-reduced a
    matchesPF' =
      \ @ a f cs ->
        foldr
          $fFoldable[]
          (\ ds ->
             case ds of _ { (cs', n) ->
             + $fNumInt (* $fNumInt (fromEnum $fEnumBool (f cs cs')) n)
             })
          (I# 0)
    
    -- non-point-free
    matches =
      \ @ a f cs ->
        foldr
          $fFoldable[]
          (\ ds a ->
             case ds of _ { (cs', n) ->
             + $fNumInt (* $fNumInt (fromEnum $fEnumBool (f cs cs')) n) a
             })
          (I# 0)
    

    So far, everything seems well. There's nothing surprising going on. Typeclass functions are replaced with their dictionary variants, e.g. foldr becomesfoldr $fFoldable[]`, since we call it on a list.

    The profile

       Mon Jul 18 15:47 2016 Time and Allocation Profiling Report  (Final)
    
           Prof +RTS -s -p -RTS
    
        total time  =        1.45 secs   (1446 ticks @ 1000 us, 1 processor)
        total alloc = 1,144,197,200 bytes  (excludes profiling overheads)
    
    COST CENTRE  MODULE    %time %alloc
    
    matchesPF'   Main       13.6    0.0
    matchesPF    Main       13.3   11.5
    main.\.\     Main       11.8   76.9
    main.f       Main       10.9    0.0
    uncurryL     Main        9.5   11.5
    matchesPFU   Main        8.9    0.0
    matchesPFI   Main        7.3    0.0
    matches      Main        6.9    0.0
    matchesPFL   Main        6.3    0.0
    uncurryI     Main        5.3    0.0
    matchesPF'.\ Main        2.6    0.0
    matchesPFI.\ Main        2.0    0.0
    matches.\    Main        1.5    0.0
    
    
                                                                 individual     inherited
    COST CENTRE        MODULE                  no.     entries  %time %alloc   %time %alloc
    
    MAIN               MAIN                     44           0    0.0    0.0   100.0  100.0
     main              Main                     89           0    0.0    0.0   100.0  100.0
      main.\           Main                     90          10    0.0    0.0   100.0  100.0
       main.\.\        Main                     92          60   11.8   76.9   100.0  100.0
        funcs          Main                     93           0    0.0    0.0    88.2   23.1
         matchesPFI    Main                    110          10    7.3    0.0    11.7    0.0
          matchesPFI.\ Main                    111     5500000    2.0    0.0     4.4    0.0
           main.f      Main                    112     5500000    2.4    0.0     2.4    0.0
         matchesPFU    Main                    107          10    8.9    0.0    15.3    0.0
          uncurryI     Main                    108     5500000    5.3    0.0     6.4    0.0
           main.f      Main                    109     5500000    1.1    0.0     1.1    0.0
         matchesPFL    Main                    104          10    6.3    0.0    17.7   11.5
          uncurryL     Main                    105     5500000    9.5   11.5    11.4   11.5
           main.f      Main                    106     5500000    1.9    0.0     1.9    0.0
         matchesPF     Main                    102          10   13.3   11.5    15.4   11.5
          main.f       Main                    103     5500000    2.1    0.0     2.1    0.0
         matchesPF'    Main                     99          10   13.6    0.0    17.2    0.0
          matchesPF'.\ Main                    100     5500000    2.6    0.0     3.6    0.0
           main.f      Main                    101     5500000    1.0    0.0     1.0    0.0
         matches       Main                     94          10    6.9    0.0    10.9    0.0
          matches.\    Main                     97     5500000    1.5    0.0     4.0    0.0
           main.f      Main                     98     5500000    2.5    0.0     2.5    0.0
     CAF               Main                     87           0    0.0    0.0     0.0    0.0
      funcs            Main                     91           1    0.0    0.0     0.0    0.0
      main             Main                     88           1    0.0    0.0     0.0    0.0
       main.\          Main                     95           0    0.0    0.0     0.0    0.0
        main.\.\       Main                     96           0    0.0    0.0     0.0    0.0
     CAF               GHC.IO.Handle.FD         84           0    0.0    0.0     0.0    0.0
     CAF               GHC.Conc.Signal          78           0    0.0    0.0     0.0    0.0
     CAF               GHC.IO.Encoding          76           0    0.0    0.0     0.0    0.0
     CAF               GHC.IO.Handle.Text       75           0    0.0    0.0     0.0    0.0
     CAF               GHC.IO.Encoding.Iconv    59           0    0.0    0.0     0.0    0.0
    

    Ignore the main\.\. noise, it's just the list. However, there's one point that one should notice immediately: matchesPF and uncurryL use the same alloc%:

    matchesPF    Main       13.3   11.5
    uncurryL     Main        9.5   11.5
    

    Getting to the CORE

    Now it's time to inspect the resulting CORE (ghc -ddump-simpl). We'll notice that most of the functions have been transformed into worker wrappers, and they look more or less the same (-dsuppress-all -dsuppress-uniques):

    $wa5
    $wa5 =
      \ @ a1 w w1 w2 ->
        letrec {
          $wgo
          $wgo =
            \ w3 ->
              case w3 of _ {
                [] -> 0;
                : y ys ->
                  case y of _ { (cs', n) ->
                  case $wgo ys of ww { __DEFAULT ->
                  case w w1 cs' of _ {
                    False -> case n of _ { I# y1 -> ww };
                    True -> case n of _ { I# y1 -> +# y1 ww }
                  }
                  }
                  }
              }; } in
        $wgo w2
    

    This is your usual worker-wrapper. $wgo takes a list, checks whether it's empty, is strict in the head (case y of _ { (cs', n) ->…) and lazy in the recursive result $wgo ys of ww.

    All functions look the same. Well, all except matchesPF (your variant)

    -- matchesPF
    $wa3 =
      \ @ a1 w w1 w2 ->
        letrec {
          $wgo =
            \ w3 ->
              case w3 of _ {
                [] -> 0;
                : y ys ->
                  case $wgo ys of ww { __DEFAULT ->
                  case let {
                         x = case y of _ { (x1, ds) -> x1 } } in
                       case w w1 x of _ {
                         False ->
                           case y of _ { (ds, y1) -> case y1 of _ { I# y2 -> main13 } };
                                  -- main13 is just #I 0
                         True -> case y of _ { (ds, y1) -> y1 }
                       }
                  of _ { I# x ->
                  +# x ww
                  }
                  }
              }; } in
        $wgo w2
    

    and matchesPFL (the variant that uses the lazy uncurryL)

    -- matchesPFL
    $wa2
    $wa2 =
      \ @ a1 w w1 w2 ->
        letrec {
          $wgo =
            \ w3 ->
              case w3 of _ {
                [] -> 0;
                : y ys ->
                  case $wgo ys of ww { __DEFAULT ->
                  case snd y of ww1 { I# ww2 ->
                  case let {
                         x = fst y } in
                       case w w1 x of _ {
                         False -> main13;
                         True -> ww1
                       }
                  of _ { I# x ->
                  +# x ww
                  }
                  }
                  }
              }; } in
        $wgo w2
    

    They are virtually the same. And both of them contain let bindings. This will create a thunk and usually lead to worse space requirements.

    The solution

    I think the culprit at this point is clear. It is uncurry. GHC wants to enforce the correct semantics of

    uncurry (const (const 0)) undefined
    

    However, this adds laziness and additional thunks. Your non-pointfree variant doesn't introduce that behaviour, since you pattern match on the pair:

    foldr (\(cs', n) a -> …)
    

    Still don't trust me? Use a lazy pattern match

    foldr (\ ~(cs', n) a -> …)
    

    and you will notice that matches will behave the same as matchesPF. So use slightly stricter variant of uncurry. uncurryI is enough to give the strictness analyzer a hint.

    Note that pairs are notorious for this behaviour. RWH spents a whole chapter trying to optimize the behaviour of a single function where intermediate pairs lead to problems.