I'm trying to work over big files using Haskell. I'd like to browse an input file byte after byte, and to generate an output byte after byte. Of course I need the IO to be buffered with blocks of reasonable size (a few KB). I can't do it, and I need your help please.
import System
import qualified Data.ByteString.Lazy as BL
import Data.Word
import Data.List
main :: IO ()
main =
do
args <- System.getArgs
let filename = head args
byteString <- BL.readFile filename
let wordsList = BL.unpack byteString
let foldFun acc word = doSomeStuff word : acc
let wordsListCopy = foldl' foldFun [] wordsList
let byteStringCopy = BL.pack (reverse wordsListCopy)
BL.writeFile (filename ++ ".cpy") byteStringCopy
where
doSomeStuff = id
I name this file TestCopy.hs
, then do the following:
$ ls -l *MB
-rwxrwxrwx 1 root root 10000000 2011-03-24 13:11 10MB
-rwxrwxrwx 1 root root 5000000 2011-03-24 13:31 5MB
$ ghc --make -O TestCopy.hs
[1 of 1] Compiling Main ( TestCopy.hs, TestCopy.o )
Linking TestCopy ...
$ time ./TestCopy 5MB
real 0m5.631s
user 0m1.972s
sys 0m2.488s
$ diff 5MB 5MB.cpy
$ time ./TestCopy 10MB
real 3m6.671s
user 0m3.404s
sys 1m21.649s
$ diff 10MB 10MB.cpy
$ time ./TestCopy 10MB +RTS -K500M -RTS
real 2m50.261s
user 0m3.808s
sys 1m13.849s
$ diff 10MB 10MB.cpy
$
My problem: There is a huge difference between a 5MB and a 10 MB file. I'd like the performances to be linear in the size of the input file. Please what am i doing wrong, and how can I achieve this? I don't mind using lazy bytestrings or anything else as long as it works, but it has to be a standard ghc library.
Precision: It's for a university project. And I'm not trying to copy files. The doSomeStuff
function shall perform compression/decompression actions that I have to customize.
I take it this is a follow on to Reading large file in haskell? from yesterday.
Try compiling with "-rtsopts -O2" instead of just "-O".
You claim "I'd like to browse an input file byte after byte, and to generate an output byte after byte." but your code reads the entire input before trying to create any output. This is just not very representative of the goal.
With my system I see "ghc -rtsopts --make -O2 b.hs" giving
(! 741)-> time ./b five
real 0m2.789s user 0m2.235s sys 0m0.518s
(! 742)-> time ./b ten
real 0m5.830s user 0m4.623s sys 0m1.027s
Which now looks linear to me.