Pointfree version worsens the performance

Well, it turns out that I got this function defined in my program code:

st_zipOp :: (a -> a -> a) -> Stream a -> Stream a -> Stream a
st_zipOp f xs ys = St.foldr (\x r -> st_map (f x) r) xs ys

It does what it seems to do. It zips (applying the operator several times, yes) two elements of type Stream a, which is a list-like type, using an inner operator of the type a. The definition is pretty straightforward.

Once I had defined the function this way, I tried this other version:

st_zipOp :: (a -> a -> a) -> Stream a -> Stream a -> Stream a
st_zipOp = St.foldr . (st_map .)

As far as I know, this is exactly the same definition as above. It is just a point-free version of the previous definition.

However, I wanted to check if there was any performance change, and I found that, indeed, the point-free version made the program run slightly worse (both in memory and time).

Why is this happening? If it is necessary, I can write a test program that reproduces this behavior.

I am compiling with -O2 if that makes a difference.

Simple test case

I wrote the following code, trying to reproduce the behavior explained above. I used lists this time, and the change in the performance was less noticeable, but still existent. This is the code:

opEvery :: (a -> a -> a) -> [a] -> [a] -> [a]
opEvery f xs ys = foldr (\x r -> map (f x) r) xs ys

opEvery' :: (a -> a -> a) -> [a] -> [a] -> [a]
opEvery' = foldr . (map .)

main :: IO ()
main = print $ sum $ opEvery (+) [1..n] [1..n]
 where
  n :: Integer
  n = 5000

The profiling results using opEvery (explicit arguments version):

total time  =        2.91 secs   (2906 ticks @ 1000 us, 1 processor)
total alloc = 1,300,813,124 bytes  (excludes profiling overheads)

The profiling results using opEvery' (point free version):

total time  =        3.24 secs   (3242 ticks @ 1000 us, 1 processor)
total alloc = 1,300,933,160 bytes  (excludes profiling overheads)

However, I expected both versions to be equivalent (in all senses).

Solution

For the simple test case, both versions yield the same core when compiled with optimisations, but without profiling.

When compiling with profiling enabled (-prof -fprof-auto), the pointfull version gets inlined, resulting in the main part being

Rec {
Main.main_go [Occ=LoopBreaker]
  :: [GHC.Integer.Type.Integer] -> [GHC.Integer.Type.Integer]
[GblId, Arity=1, Str=DmdType S]
Main.main_go =
  \ (ds_asR :: [GHC.Integer.Type.Integer]) ->
    case ds_asR of _ {
      [] -> xs_r1L8;
      : y_asW ys_asX ->
        let {
          r_aeN [Dmd=Just S] :: [GHC.Integer.Type.Integer]
          [LclId, Str=DmdType]
          r_aeN = Main.main_go ys_asX } in
        scctick<opEvery.\>
        GHC.Base.map
          @ GHC.Integer.Type.Integer
          @ GHC.Integer.Type.Integer
          (GHC.Integer.Type.plusInteger y_asW)
          r_aeN
    }
end Rec }

(you get something better without profiling).

When compiling the pointfree version with profiling enabled, opEvery' is not inlined, and you get

Main.opEvery'
  :: forall a_aeW.
     (a_aeW -> a_aeW -> a_aeW) -> [a_aeW] -> [a_aeW] -> [a_aeW]
[GblId,
 Str=DmdType,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=0, Value=False,
         ConLike=False, WorkFree=False, Expandable=False,
         Guidance=IF_ARGS [] 80 60}]
Main.opEvery' =
  \ (@ a_c) ->
    tick<opEvery'>
    \ (x_ass :: a_c -> a_c -> a_c) ->
      scc<opEvery'>
      GHC.Base.foldr
        @ a_c
        @ [a_c]
        (\ (x1_XsN :: a_c) -> GHC.Base.map @ a_c @ a_c (x_ass x1_XsN))

Main.main4 :: [GHC.Integer.Type.Integer]
[GblId,
 Str=DmdType,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=0, Value=False,
         ConLike=False, WorkFree=False, Expandable=False,
         Guidance=IF_ARGS [] 40 0}]
Main.main4 =
  scc<main>
  Main.opEvery'
    @ GHC.Integer.Type.Integer
    GHC.Integer.Type.plusInteger
    Main.main7
    Main.main5

When you add an {-# INLINABLE opEvery' #-} pragma, it can be inlined even when compiling for profiling, giving

Rec {
Main.main_go [Occ=LoopBreaker]
  :: [GHC.Integer.Type.Integer] -> [GHC.Integer.Type.Integer]
[GblId, Arity=1, Str=DmdType S]
Main.main_go =
  \ (ds_asz :: [GHC.Integer.Type.Integer]) ->
    case ds_asz of _ {
      [] -> lvl_r1KU;
      : y_asE ys_asF ->
        GHC.Base.map
          @ GHC.Integer.Type.Integer
          @ GHC.Integer.Type.Integer
          (GHC.Integer.Type.plusInteger y_asE)
          (Main.main_go ys_asF)
    }
end Rec }

which is even a bit faster than the pragma-less pointfull version, since it doesn't need to tick the counters.

It is likely that a similar effect occurred for the Stream case.

The takeaway:

Profiling inhibits optimisations. Code that is equivalent without profiling may not be with profiling support.
Never measure performance using code that was compiled for profiling or without optimisations.
Profiling can help you find out where the time is spent in your code [but, occasionally, enabling profiling can entirely alter the behaviour of the code; anything relying heavily on rewrite-rule optimisations and/or inlining is a candidate for that to happen], but it cannot tell you how fast your code is.