Consider this trivial algorithm of prime-decomposition of an integer n
: Let d'
be the divisor of n
last found. Initially, set d'=1
. Find the smallest divisor d>d'
of n
, and find the maximal value e
such that de
divides n
. Append de
to the answer and repeat the procedure on n/de
. Finally, stop when n
becomes 1. For simplicity, let's ignore mathematical optimizations, like stop at sqrt n
etc.
I have implemented it in two ways. The first one generates a list of division "attempts", and then groups the successful ones by divisors. For example, for n=20
, we first generate [(2,20),(2,10),(2,5),(3,5),(4,5),(5,5),(5,1)]
, which we then transform to the desired [(2,2),(5,1)]
using group
and other library functions.
The second implementation is an explicit recursion which keeps track of the exponent e
along the way, appends de
to the answer once the maximal e
is reached, proceeds to finding the "next" d
, and so on.
Question 1: Why does the first implementation run way slower than the second, despite the following:
div
, the core step of the algorithm, roughly the same number of times.divTrials n
, the list I am talking about, is transformed by a chain of higher order functions. In that, I think that the part map (\xs-> (head xs,length xs)) ... group
should tell the compiler that the list is just intermediate:{-# OPTIONS_GHC -O2 #-}
module GroupCheck where
import Data.List
import Data.Maybe
implement1 :: Integral t=> t -> [(t,Int)] -- IMPLEMENTATION 1
implement1 = map (\xs-> (head xs,length xs)).factorGroups where
tryDiv (d,n)
| n `mod` d == 0 = (d,n `div` d)
| n == 1 = (1,1) -- hack
| otherwise = (d+1,n)
divTrials n = takeWhile (/=(1,1)) $ (2,n): map tryDiv (divTrials n)
factorGroups = filter (not.null).map tail.group.map fst.divTrials
implement2 :: Show t => Integral t => t -> [(t,Int)] -- IMPLEMENTATION 2
implement2 num = keep2 $ tail $ go (1,0,1,num) where
range d n = [d+1..n]
nextd d n = fromMaybe n $ find ((0==).(n`mod`)) (range d n)
update (d,e,de,n)
| n `mod` d == 0 = update (d,e+1,de*d,n`div`d)
| otherwise = (d,e,de,n)
go (d,e,de,1) = [(d,e,de,1)]
go (d,e,de,n) = (d,e,de,n) : go (update (nextd d n,0,1,n))
keep2 = map (\(d,e,_,_)->(d,e))
main :: IO ()
main = do
let n = 293872
let ans1 = implement1 n
let ans2 = implement2 n
print ans1
print ans2
Profiling tells us that tryDiv
and divTrials
together eat up >99% of the entire execution time:
> stack ghc -- -main-is GroupCheck.main -prof -fprof-auto -rtsopts GroupCheck
> ./GroupCheck +RTS -p >/dev/null && cat GroupCheck.prof
GroupCheck +RTS -p -RTS
total time = 18.34 secs (18338 ticks @ 1000 us, 1 processor)
total alloc = 17,561,404,568 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
implement1.divTrials GroupCheck GroupCheck.hs:12:3-69 52.6 69.2
implement1.tryDiv GroupCheck GroupCheck.hs:(8,3)-(11,25) 47.2 30.8
Question 1.5: So.. what's so bad about these functions? Also,
Question 2: In a more general case of having to aggregate contiguous blocks of identical elements from a nondecreasing sequence, should we go the bulky implement2
way if we want speed? (Again, ignoring domain-specific optimizations.)
Or did I totally miss something obvious? Thanks!
Just to establish a baseline, I ran your program on a slightly larger starting number (so that time
didn't print out 0.00s). I chose n = 2938722345623
for no particular reason. Here's the timings before starting to tweak things:
ans1
: indistinguishable from infinity (I finished writing this entire answer and it was still running, about 26 minutes in total)
ans2
: 2.78s
The first thing to try is to tweak this line:
divTrials n = takeWhile (/=(1,1)) $ (2,n): map tryDiv (divTrials n)
This looks like a pretty natural definition, but it turns out that GHC never memoizes function calls. So if you want to make a list that's defined recursively in terms of itself, you must not make a function call in the recursion. Here's how:
divTrials n = xs where xs = takeWhile (/=(1,1)) $ (2,n): map tryDiv xs
Just that change brings the time down to 7.85s. Still off by a factor of about 3, but much much better.
The less obvious problem lies here:
factorGroups = filter (not.null).map tail.group.map fst.divTrials
Putting the group
so early breaks fusion, causing that intermediate list to actually be materialized. This means allocating and deallocating a lot of cons cells and tuples. Here's an implementation that has the same spirit, but puts more work before the group
:
tryDiv d n
| n `mod` d == 0 = d : tryDiv d (n `div` d)
| n == 1 = []
| otherwise = tryDiv (d+1) n
factorGroups = group . tryDiv 2
With that, we are down to 2.65s -- slightly faster than ans2
, though I only did one test of each so it's pretty likely to just be measurement noise.