Search code examples
haskellbytestringlong-lineslazy-io

Haskell lazy Bytestring words not lazy?


I have the following Haskell program for computing a maximum sum substring of a string of integers:

{-# LANGUAGE BangPatterns #-} {-# OPTIONS_GHC -O2 #-}
import Data.Functor
import Data.Maybe 
import Data.ByteString.Lazy.Char8 (getContents,lines,readInt,words)
import Prelude hiding (getContents,words,lines)

main = do
    cont <- words <$> getContents
    putStrLn $ show $ snd $ foldl opt (0,0) $ map (fst.fromJust.readInt) cont

opt (!c,!m) x = (max 0 (c+x),max m (c+x))

The problem with this program is that it reads the whole file into memory. The corresponding program without BytesString does not have this problem:

{-# LANGUAGE BangPatterns #-} {-# OPTIONS_GHC -O2 #-}
import Data.Functor
import Data.Maybe 

main = do
    cont <- words <$> getContents
    putStrLn $ show $ snd $ foldl opt (0,0) $ map read cont
opt (!c,!m) x = (max 0 (c+x),max m (c+x))

It only uses a small constant amount of memory, but of course it is excruciatingly slow (about 25x slower).

The problem only occurs for programs that read very large lines. If the input is spread over multiple small lines, ByteString performs as expected.

Is there any way around this?


Solution

  • The use of lazy tuples there is sub-optimal. This is better rewritten as:

    main = do
        cont <- words <$> getContents
        putStrLn $ show $ sndT $ foldl opt (T 0 0) $ map (fst.fromJust.readInt) cont
    
    sndT :: T -> Int
    sndT (T _ m) = m
    
    opt (T c m) x = T (max 0 (c+x)) (max m (c+x))
    
    data T = T {-# UNPACK #-} !Int  {-# UNPACK #-}!Int
    

    So you get a strict, unboxed accumulator. However, you're better off writing this whole thing as an incremental left fold. that's why readInt returns the remaining input in its 2nd parameter. No need for the sum . map . words pipeline.


    The version you submitted leaks space. Run on a large file, and it uses heap proportional to the file size (on 640k entries).

    $ time ./A +RTS -p -s -K50M < input.txt.2
    346882
         326,337,136 bytes allocated in the heap
         302,321,732 bytes copied during GC
          82,617,772 bytes maximum residency (8 sample(s))
           1,466,500 bytes maximum slop
                 149 MB total memory in use (0 MB lost due to fragmentation)
    
      %GC     time      63.8%  (63.9% elapsed)
    

    So it is retaining the file, as you say.

    enter image description here

    So what is retaining memory? One clue is the foldl with opt. You pass it a lazy tuple. And foldl is lazy in its accumulator.

    So you're simply building up a long chain of unevaluated + operations. The bang patterns on opt make no difference, since foldl never evaluates its accumulator. Only when you finally inspect the result at the end does the whole thing collapse.

    This is a classic space leak. So:

    • Use foldl' -- it is strict in the accumulator
    • Avoid intermediate lists entirely (use readInt + unfoldr).
    • If you must traverse a list, use a strict tuple on the accumulator for even better performance.