Here's haskell code
import GHC.Int
triples = [(x, y, z) | z <- [(1::Int32)..],
x <- [(1::Int32) .. z + 1],
y <- [x.. z + 1],
x * x + y * y == z * z]
main = mapM_ print (Prelude.take 1000 triples)
Which has following profile
triples +RTS -p -RTS
total time = 47.10 secs (47103 ticks @ 1000 us, 1 processor)
total alloc = 62,117,115,176 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
triples Main triples.hs:(5,1)-(8,46) 100.0 100.0
individual inherited
COST CENTRE MODULE SRC no. entries %time %alloc %time %alloc
MAIN MAIN <built-in> 118 0 0.0 0.0 100.0 100.0
CAF Main <entire-module> 235 0 0.0 0.0 100.0 100.0
main Main triples.hs:10:1-46 236 1 0.0 0.0 0.0 0.0
triples Main triples.hs:(5,1)-(8,46) 237 1 100.0 100.0 100.0 100.0
CAF GHC.Conc.Signal <entire-module> 227 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding <entire-module> 216 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv <entire-module> 214 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD <entire-module> 206 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.Text <entire-module> 144 0 0.0 0.0 0.0 0.0
main Main triples.hs:10:1-46 238 0 0.0 0.0 0.0 0.0
While equivalent rust
code runs an order of magnitude faster. Which seems very strange to me.
fn triples() -> impl Iterator<Item=(i32, i32, i32)> {
(1..).flat_map(|z| {
(1..z + 1).flat_map(move |x| {
(x..z + 1).filter_map(move |y| {
if x * x + y * y == z * z {
Some((x, y, z))
} else {
None
}
})
})
})
}
fn main() {
for triple in triples().take(1000) {
println!("{:?}", triple);
// unsafe {printf("(%i, %i, %i)\n".as_ptr() as *const i8, x, y, z)};
}
}
Results are
[I] ~/c/pythagoras (master|✚1…) $ time ./range > /dev/null
0.16user 0.00system 0:00.16elapsed 100%CPU (0avgtext+0avgdata 2248maxresident)k
0inputs+0outputs (0major+124minor)pagefaults 0swaps
[I] ~/c/pythagoras (master|✚1…) $ time ./triples > /dev/null
2.39user 0.00system 0:02.39elapsed 99%CPU (0avgtext+0avgdata 4736maxresident)k
0inputs+0outputs (0major+473minor)pagefaults 0swaps
Both results are with -O3
flag.
Is it possible to optimize allocations out while saving idiomatic haskell code? Maybe some fusion library or something can do this?
EDIT1. Okay using Int
instead of Int32
or Int64
makes the code faster which is nice. Still with fflvm
it's twice slower than rust and judging by profile it's still spends most of the time in allocations. What's preventing haskell from reusing the triple for example and not allocating it only once?
There are two issues with your code:
For performance, you should compile without profiling, and with optimization. Profiling adds significant overhead. On my system ghc -prof
results in a runtime of over 40 seconds, similarly to your time. ghc -O2
without -prof
yields just 4.2 seconds.
Using Int32
on a 64 bit system. You shouldn't do this, because non-native sized Int
operations are compiled to slow out-of-line primops. When I change Int32
to Int
, runtime becomes 0.44 seconds. If I additionally use -fllvm
for the LLVM code backend, I get 0.2 seconds.