I was wondering if there's an easy way to get lines one at a time out of a file without eventually loading the whole file in memory. I'd like to do a fold over the lines with an attoparsec parser. I tried using Data.Text.Lazy.IO
with hGetLine
and that blows through my memory. I read later that eventually loads the whole file.
I also tried using pipes-text with folds
and view lines
:
s <- Pipes.sum $
folds (\i _ -> (i+1)) 0 id (view Text.lines (Text.fromHandle handle))
print s
to just count the number of lines and it seems to be doing some wonky stuff "hGetChunk: invalid argument (invalid byte sequence)" and it takes 11 minutes where wc -l
takes 1 minute. I heard that pipes-text might have some issues with gigantic lines? (Each line is about 1GB)
I'm really open to any suggestions, can't find much searching except for newbie readLine
how-tos.
Thanks!
The following code uses Conduit, and will:
lineC
combinator as long as there is more data availableyield
the value 1
and discard the line content, without ever read the entire line into memory at once1
s yielded and print itYou can replace the yield 1
code with something which will do processing on the individual lines.
#!/usr/bin/env stack
-- stack --resolver lts-8.4 --install-ghc runghc --package conduit-combinators
import Conduit
main :: IO ()
main = (runConduit
$ stdinC
.| decodeUtf8C
.| peekForeverE (lineC (yield (1 :: Int)))
.| sumC) >>= print