performance haskell grouping lazy-evaluation prime-factoring

Haskell: map length . group is way slower than explicit recursion?

Consider this trivial algorithm of prime-decomposition of an integer n: Let d' be the divisor of n last found. Initially, set d'=1. Find the smallest divisor d>d' of n, and find the maximal value e such that d^e divides n. Append d^e to the answer and repeat the procedure on n/d^e. Finally, stop when n becomes 1. For simplicity, let's ignore mathematical optimizations, like stop at sqrt n etc.

I have implemented it in two ways. The first one generates a list of division "attempts", and then groups the successful ones by divisors. For example, for n=20, we first generate [(2,20),(2,10),(2,5),(3,5),(4,5),(5,5),(5,1)], which we then transform to the desired [(2,2),(5,1)] using group and other library functions.

The second implementation is an explicit recursion which keeps track of the exponent e along the way, appends d^e to the answer once the maximal e is reached, proceeds to finding the "next" d, and so on.

Question 1: Why does the first implementation run way slower than the second, despite the following:

Both the implementations execute div, the core step of the algorithm, roughly the same number of times.
Lazy evaluation (and fusion?) has the effect that the long list illustrated above never has to be materialized in the first place. As you can see in the code below, divTrials n, the list I am talking about, is transformed by a chain of higher order functions. In that, I think that the part map (\xs-> (head xs,length xs)) ... group should tell the compiler that the list is just intermediate:

{-# OPTIONS_GHC -O2 #-}
module GroupCheck where
import Data.List
import Data.Maybe

implement1 :: Integral t=> t -> [(t,Int)]                    -- IMPLEMENTATION 1
implement1  = map (\xs-> (head xs,length xs)).factorGroups where
  tryDiv (d,n)
    | n `mod` d == 0 = (d,n `div` d)
    | n == 1 = (1,1) -- hack
    | otherwise = (d+1,n)
  divTrials n = takeWhile (/=(1,1)) $ (2,n): map tryDiv (divTrials n)
  factorGroups = filter (not.null).map tail.group.map fst.divTrials

implement2 :: Show t => Integral t => t -> [(t,Int)]         -- IMPLEMENTATION 2
implement2 num = keep2 $ tail $ go (1,0,1,num) where
  range d n = [d+1..n]
  nextd d n = fromMaybe n $ find ((0==).(n`mod`)) (range d n)
  update (d,e,de,n)
    | n `mod` d == 0 = update (d,e+1,de*d,n`div`d)
    | otherwise      = (d,e,de,n)
  go (d,e,de,1) = [(d,e,de,1)]
  go (d,e,de,n) = (d,e,de,n) : go (update (nextd d n,0,1,n))
  keep2 = map (\(d,e,_,_)->(d,e))

main :: IO ()
main = do
  let n = 293872
  let ans1 = implement1 n 
  let ans2 = implement2 n
  print ans1
  print ans2

Profiling tells us that tryDiv and divTrials together eat up >99% of the entire execution time:

> stack ghc -- -main-is GroupCheck.main -prof -fprof-auto -rtsopts GroupCheck 
> ./GroupCheck +RTS -p >/dev/null && cat GroupCheck.prof


           GroupCheck +RTS -p -RTS

        total time  =       18.34 secs   (18338 ticks @ 1000 us, 1 processor)
        total alloc = 17,561,404,568 bytes  (excludes profiling overheads)

COST CENTRE          MODULE     SRC                          %time %alloc

implement1.divTrials GroupCheck GroupCheck.hs:12:3-69         52.6   69.2
implement1.tryDiv    GroupCheck GroupCheck.hs:(8,3)-(11,25)   47.2   30.8

Question 1.5: So.. what's so bad about these functions? Also,

Question 2: In a more general case of having to aggregate contiguous blocks of identical elements from a nondecreasing sequence, should we go the bulky implement2 way if we want speed? (Again, ignoring domain-specific optimizations.)

Or did I totally miss something obvious? Thanks!

Solution

Just to establish a baseline, I ran your program on a slightly larger starting number (so that time didn't print out 0.00s). I chose n = 2938722345623 for no particular reason. Here's the timings before starting to tweak things:

ans1: indistinguishable from infinity (I finished writing this entire answer and it was still running, about 26 minutes in total)
ans2: 2.78s

The first thing to try is to tweak this line:

divTrials n = takeWhile (/=(1,1)) $ (2,n): map tryDiv (divTrials n)

This looks like a pretty natural definition, but it turns out that GHC never memoizes function calls. So if you want to make a list that's defined recursively in terms of itself, you must not make a function call in the recursion. Here's how:

divTrials n = xs where xs = takeWhile (/=(1,1)) $ (2,n): map tryDiv xs

Just that change brings the time down to 7.85s. Still off by a factor of about 3, but much much better.

The less obvious problem lies here:

factorGroups = filter (not.null).map tail.group.map fst.divTrials

Putting the group so early breaks fusion, causing that intermediate list to actually be materialized. This means allocating and deallocating a lot of cons cells and tuples. Here's an implementation that has the same spirit, but puts more work before the group:

  tryDiv d n
    | n `mod` d == 0 = d : tryDiv d (n `div` d)
    | n == 1 = []
    | otherwise = tryDiv (d+1) n
  factorGroups = group . tryDiv 2

With that, we are down to 2.65s -- slightly faster than ans2, though I only did one test of each so it's pretty likely to just be measurement noise.