Search code examples
performancehaskellfunctional-programmingbytestring

Increasing performance in file manipulation


I have a file which contains a matrix of numbers as following:

0 10 24 10 13 4 101 ...
6 0 52 10 4 5 0 4 ...
3 4 0 86 29 20 77 294 ...
4 1 1 0 78 100 83 199 ...
5 4 9 10 0 58 8 19 ...
6 58 60 13 68 0 148 41 ...
. .
.   .
.     .

What I am trying to do is sum each row and output the sum of each row to a new file (with the sum of each row on a new line).

I have tried doing it in Haskell using ByteStrings, but the performance is 3 times a slow as the python implementation. Here is the Haskell implementation:

import qualified Data.ByteString.Char8 as B

-- This function is for summing a row
sumrows r = foldr (\x y -> (maybe 0 (*1) $ fst <$> (B.readInt x)) + y) 0 (B.split ' ' r)

-- This function is for mapping the sumrows function to each line
sumfile f = map (\x -> (show x) ++ "\n") (map sumrows (B.split '\n' f)) 

main = do
  contents <- B.readFile "telematrix"
  -- I get the sum of each line, and then pack up all the results so that it can be written
  B.writeFile "teleDensity" $ (B.pack . unwords) (sumfile contents)
  print "complete"

This takes about 14 seconds for a 25 MB file.

Here is the python implemenation

fd = open("telematrix", "r")
nfd = open("teleDensity", "w")

for line in fd: 
  nfd.write(str(sum(map(int, line.split(" ")))) + "\n")

fd.close()
nfd.close()

This takes about 5 seconds for the same 25 MB file.

Any suggestions on how to increase the Haskell implementation?


Solution

  • The main reason for the poor performance was because I was using runhaskell instead of first compiling and then running the program. So I switched from:

    runhaskell program.hs
    

    to

    ghc program.hs
    
    ./program