Search code examples
performancehaskellpointfree

Pointfree version worsens the performance


Well, it turns out that I got this function defined in my program code:

st_zipOp :: (a -> a -> a) -> Stream a -> Stream a -> Stream a
st_zipOp f xs ys = St.foldr (\x r -> st_map (f x) r) xs ys

It does what it seems to do. It zips (applying the operator several times, yes) two elements of type Stream a, which is a list-like type, using an inner operator of the type a. The definition is pretty straightforward.

Once I had defined the function this way, I tried this other version:

st_zipOp :: (a -> a -> a) -> Stream a -> Stream a -> Stream a
st_zipOp = St.foldr . (st_map .)

As far as I know, this is exactly the same definition as above. It is just a point-free version of the previous definition.

However, I wanted to check if there was any performance change, and I found that, indeed, the point-free version made the program run slightly worse (both in memory and time).

Why is this happening? If it is necessary, I can write a test program that reproduces this behavior.

I am compiling with -O2 if that makes a difference.

Simple test case

I wrote the following code, trying to reproduce the behavior explained above. I used lists this time, and the change in the performance was less noticeable, but still existent. This is the code:

opEvery :: (a -> a -> a) -> [a] -> [a] -> [a]
opEvery f xs ys = foldr (\x r -> map (f x) r) xs ys

opEvery' :: (a -> a -> a) -> [a] -> [a] -> [a]
opEvery' = foldr . (map .)

main :: IO ()
main = print $ sum $ opEvery (+) [1..n] [1..n]
 where
  n :: Integer
  n = 5000

The profiling results using opEvery (explicit arguments version):

total time  =        2.91 secs   (2906 ticks @ 1000 us, 1 processor)
total alloc = 1,300,813,124 bytes  (excludes profiling overheads)

The profiling results using opEvery' (point free version):

total time  =        3.24 secs   (3242 ticks @ 1000 us, 1 processor)
total alloc = 1,300,933,160 bytes  (excludes profiling overheads)

However, I expected both versions to be equivalent (in all senses).


Solution

  • For the simple test case, both versions yield the same core when compiled with optimisations, but without profiling.

    When compiling with profiling enabled (-prof -fprof-auto), the pointfull version gets inlined, resulting in the main part being

    Rec {
    Main.main_go [Occ=LoopBreaker]
      :: [GHC.Integer.Type.Integer] -> [GHC.Integer.Type.Integer]
    [GblId, Arity=1, Str=DmdType S]
    Main.main_go =
      \ (ds_asR :: [GHC.Integer.Type.Integer]) ->
        case ds_asR of _ {
          [] -> xs_r1L8;
          : y_asW ys_asX ->
            let {
              r_aeN [Dmd=Just S] :: [GHC.Integer.Type.Integer]
              [LclId, Str=DmdType]
              r_aeN = Main.main_go ys_asX } in
            scctick<opEvery.\>
            GHC.Base.map
              @ GHC.Integer.Type.Integer
              @ GHC.Integer.Type.Integer
              (GHC.Integer.Type.plusInteger y_asW)
              r_aeN
        }
    end Rec }
    

    (you get something better without profiling).

    When compiling the pointfree version with profiling enabled, opEvery' is not inlined, and you get

    Main.opEvery'
      :: forall a_aeW.
         (a_aeW -> a_aeW -> a_aeW) -> [a_aeW] -> [a_aeW] -> [a_aeW]
    [GblId,
     Str=DmdType,
     Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=0, Value=False,
             ConLike=False, WorkFree=False, Expandable=False,
             Guidance=IF_ARGS [] 80 60}]
    Main.opEvery' =
      \ (@ a_c) ->
        tick<opEvery'>
        \ (x_ass :: a_c -> a_c -> a_c) ->
          scc<opEvery'>
          GHC.Base.foldr
            @ a_c
            @ [a_c]
            (\ (x1_XsN :: a_c) -> GHC.Base.map @ a_c @ a_c (x_ass x1_XsN))
    
    Main.main4 :: [GHC.Integer.Type.Integer]
    [GblId,
     Str=DmdType,
     Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=0, Value=False,
             ConLike=False, WorkFree=False, Expandable=False,
             Guidance=IF_ARGS [] 40 0}]
    Main.main4 =
      scc<main>
      Main.opEvery'
        @ GHC.Integer.Type.Integer
        GHC.Integer.Type.plusInteger
        Main.main7
        Main.main5
    

    When you add an {-# INLINABLE opEvery' #-} pragma, it can be inlined even when compiling for profiling, giving

    Rec {
    Main.main_go [Occ=LoopBreaker]
      :: [GHC.Integer.Type.Integer] -> [GHC.Integer.Type.Integer]
    [GblId, Arity=1, Str=DmdType S]
    Main.main_go =
      \ (ds_asz :: [GHC.Integer.Type.Integer]) ->
        case ds_asz of _ {
          [] -> lvl_r1KU;
          : y_asE ys_asF ->
            GHC.Base.map
              @ GHC.Integer.Type.Integer
              @ GHC.Integer.Type.Integer
              (GHC.Integer.Type.plusInteger y_asE)
              (Main.main_go ys_asF)
        }
    end Rec }
    

    which is even a bit faster than the pragma-less pointfull version, since it doesn't need to tick the counters.

    It is likely that a similar effect occurred for the Stream case.

    The takeaway:

    • Profiling inhibits optimisations. Code that is equivalent without profiling may not be with profiling support.
    • Never measure performance using code that was compiled for profiling or without optimisations.
    • Profiling can help you find out where the time is spent in your code [but, occasionally, enabling profiling can entirely alter the behaviour of the code; anything relying heavily on rewrite-rule optimisations and/or inlining is a candidate for that to happen], but it cannot tell you how fast your code is.