How to optimize this pythagoras triples implementation

Here's haskell code

import GHC.Int

triples = [(x, y, z) | z <- [(1::Int32)..],
                       x <- [(1::Int32) .. z + 1],
                       y <- [x.. z + 1],
                       x * x + y * y == z * z]

main = mapM_ print (Prelude.take 1000 triples)

Which has following profile

       triples +RTS -p -RTS

    total time  =       47.10 secs   (47103 ticks @ 1000 us, 1 processor)
    total alloc = 62,117,115,176 bytes  (excludes profiling overheads)

COST CENTRE MODULE    SRC                      %time %alloc

triples     Main      triples.hs:(5,1)-(8,46)  100.0  100.0

                                                                              individual      inherited
COST CENTRE  MODULE                SRC                     no.     entries  %time %alloc   %time %alloc

MAIN         MAIN                  <built-in>              118          0    0.0    0.0   100.0  100.0
 CAF         Main                  <entire-module>         235          0    0.0    0.0   100.0  100.0
  main       Main                  triples.hs:10:1-46      236          1    0.0    0.0     0.0    0.0
  triples    Main                  triples.hs:(5,1)-(8,46) 237          1  100.0  100.0   100.0  100.0
 CAF         GHC.Conc.Signal       <entire-module>         227          0    0.0    0.0     0.0    0.0
 CAF         GHC.IO.Encoding       <entire-module>         216          0    0.0    0.0     0.0    0.0
 CAF         GHC.IO.Encoding.Iconv <entire-module>         214          0    0.0    0.0     0.0    0.0
 CAF         GHC.IO.Handle.FD      <entire-module>         206          0    0.0    0.0     0.0    0.0
 CAF         GHC.IO.Handle.Text    <entire-module>         144          0    0.0    0.0     0.0    0.0
 main        Main                  triples.hs:10:1-46      238          0    0.0    0.0     0.0    0.0

While equivalent rust code runs an order of magnitude faster. Which seems very strange to me.

fn triples() -> impl Iterator<Item=(i32, i32, i32)> {
    (1..).flat_map(|z| {
        (1..z + 1).flat_map(move |x| {
            (x..z + 1).filter_map(move |y| {
                if x * x + y * y == z * z {
                    Some((x, y, z))
                } else {
                    None
                }
            })
        })
    })
}

fn main() {
    for triple in triples().take(1000) {
        println!("{:?}", triple);
        // unsafe {printf("(%i, %i, %i)\n".as_ptr() as *const i8, x, y, z)};
    }
}

Results are

[I] ~/c/pythagoras (master|✚1…) $ time ./range > /dev/null
0.16user 0.00system 0:00.16elapsed 100%CPU (0avgtext+0avgdata 2248maxresident)k
0inputs+0outputs (0major+124minor)pagefaults 0swaps
[I] ~/c/pythagoras (master|✚1…) $ time ./triples > /dev/null
2.39user 0.00system 0:02.39elapsed 99%CPU (0avgtext+0avgdata 4736maxresident)k
0inputs+0outputs (0major+473minor)pagefaults 0swaps

Both results are with -O3 flag.

Is it possible to optimize allocations out while saving idiomatic haskell code? Maybe some fusion library or something can do this?

EDIT1. Okay using Int instead of Int32 or Int64 makes the code faster which is nice. Still with fflvm it's twice slower than rust and judging by profile it's still spends most of the time in allocations. What's preventing haskell from reusing the triple for example and not allocating it only once?

Solution

There are two issues with your code:

For performance, you should compile without profiling, and with optimization. Profiling adds significant overhead. On my system ghc -prof results in a runtime of over 40 seconds, similarly to your time. ghc -O2 without -prof yields just 4.2 seconds.
Using Int32 on a 64 bit system. You shouldn't do this, because non-native sized Int operations are compiled to slow out-of-line primops. When I change Int32 to Int, runtime becomes 0.44 seconds. If I additionally use -fllvm for the LLVM code backend, I get 0.2 seconds.