Search code examples
haskellhaskell-pipes

Read large lines in huge file without buffering


I was wondering if there's an easy way to get lines one at a time out of a file without eventually loading the whole file in memory. I'd like to do a fold over the lines with an attoparsec parser. I tried using Data.Text.Lazy.IO with hGetLine and that blows through my memory. I read later that eventually loads the whole file.

I also tried using pipes-text with folds and view lines:

s <- Pipes.sum $ 
    folds (\i _ -> (i+1)) 0 id (view Text.lines (Text.fromHandle handle))
print s

to just count the number of lines and it seems to be doing some wonky stuff "hGetChunk: invalid argument (invalid byte sequence)" and it takes 11 minutes where wc -l takes 1 minute. I heard that pipes-text might have some issues with gigantic lines? (Each line is about 1GB)

I'm really open to any suggestions, can't find much searching except for newbie readLine how-tos.

Thanks!


Solution

  • The following code uses Conduit, and will:

    • UTF8-decode standard input
    • Run the lineC combinator as long as there is more data available
    • For each line, simply yield the value 1 and discard the line content, without ever read the entire line into memory at once
    • Sum up the 1s yielded and print it

    You can replace the yield 1 code with something which will do processing on the individual lines.

    #!/usr/bin/env stack
    -- stack --resolver lts-8.4 --install-ghc runghc --package conduit-combinators
    import Conduit
    
    main :: IO ()
    main = (runConduit
         $ stdinC
        .| decodeUtf8C
        .| peekForeverE (lineC (yield (1 :: Int)))
        .| sumC) >>= print