Search code examples
iolazy-evaluationbytestringhaskell

IO over big files in haskell: Performance issue


I'm trying to work over big files using Haskell. I'd like to browse an input file byte after byte, and to generate an output byte after byte. Of course I need the IO to be buffered with blocks of reasonable size (a few KB). I can't do it, and I need your help please.

import System 
import qualified Data.ByteString.Lazy as BL 
import Data.Word  
import Data.List

main :: IO () 
main =     
    do         
        args <- System.getArgs         
        let filename = head args         
        byteString <- BL.readFile filename         
        let wordsList = BL.unpack byteString         
        let foldFun acc word = doSomeStuff word : acc
        let wordsListCopy = foldl' foldFun [] wordsList
        let byteStringCopy = BL.pack (reverse wordsListCopy)
        BL.writeFile (filename ++ ".cpy") byteStringCopy
    where
        doSomeStuff = id

I name this file TestCopy.hs, then do the following:

$ ls -l *MB
-rwxrwxrwx 1 root root 10000000 2011-03-24 13:11 10MB
-rwxrwxrwx 1 root root  5000000 2011-03-24 13:31 5MB
$ ghc --make -O TestCopy.hs 
[1 of 1] Compiling Main             ( TestCopy.hs, TestCopy.o )
Linking TestCopy ...
$ time ./TestCopy 5MB

real    0m5.631s
user    0m1.972s
sys 0m2.488s
$ diff 5MB 5MB.cpy
$ time ./TestCopy 10MB 

real    3m6.671s
user    0m3.404s
sys 1m21.649s
$ diff 10MB 10MB.cpy 
$ time ./TestCopy 10MB +RTS -K500M -RTS

real    2m50.261s
user    0m3.808s
sys 1m13.849s
$ diff 10MB 10MB.cpy 
$ 

My problem: There is a huge difference between a 5MB and a 10 MB file. I'd like the performances to be linear in the size of the input file. Please what am i doing wrong, and how can I achieve this? I don't mind using lazy bytestrings or anything else as long as it works, but it has to be a standard ghc library.

Precision: It's for a university project. And I'm not trying to copy files. The doSomeStuff function shall perform compression/decompression actions that I have to customize.


Solution

  • I take it this is a follow on to Reading large file in haskell? from yesterday.

    Try compiling with "-rtsopts -O2" instead of just "-O".

    You claim "I'd like to browse an input file byte after byte, and to generate an output byte after byte." but your code reads the entire input before trying to create any output. This is just not very representative of the goal.

    With my system I see "ghc -rtsopts --make -O2 b.hs" giving

    (! 741)-> time ./b five
    real 0m2.789s user 0m2.235s sys 0m0.518s
    (! 742)-> time ./b ten
    real 0m5.830s user 0m4.623s sys 0m1.027s
    

    Which now looks linear to me.