Search code examples
haskelllazy-evaluationstrictness

Stack space overflow (possibly related to mapM)


I'm writing a program that creates a shell script containing one command for each image file in a directory. There are 667,944 images in the directory, so I need to handle the strictness/laziness issue properly.

Here's a simple example that gives me Stack space overflow. It does work if I give it more space using +RTS -Ksize -RTS, but it should be able run with little memory, producing output immediately. So I've been reading the stuff about strictness in the Haskell wiki and the wikibook on Haskell, trying to figure out how to fix the problem, and I think it's one of the mapM commands that is giving me grief, but I still don't understand enough about strictness to sort the problem.

I've found some other questions on SO that seem relevant (Is mapM in Haskell strict? Why does this program get a stack overflow? and Is Haskell's mapM not lazy?), but enlightenment still eludes me.

import System.Environment (getArgs)
import System.Directory (getDirectoryContents)

genCommand :: FilePath -> FilePath -> FilePath -> IO String
genCommand indir outdir file = do
  let infile = indir ++ '/':file
  let angle = 0 -- have to actually read the file to calculate this for real
  let outfile = outdir ++ '/':file
  return $! "convert " ++ infile ++ " -rotate " ++ show angle ++ 
    " -crop 143x143+140+140 " ++ outfile

main :: IO ()
main = do
  putStrLn "#!/bin/sh"
  (indir:outdir:_) <- getArgs
  files <- getDirectoryContents indir
  let imageFiles = filter (`notElem` [".", ".."]) files
  commands <- mapM (genCommand indir outdir) imageFiles
  mapM_ putStrLn commands

EDIT: TEST #1

Here's the newest version of the example.

import System.Environment (getArgs)
import System.Directory (getDirectoryContents)
import Control.Monad ((>=>))

genCommand :: FilePath -> FilePath -> FilePath -> IO String
genCommand indir outdir file = do
  let infile = indir ++ '/':file
  let angle = 0 -- have to actually read the file to calculate this for real
  let outfile = outdir ++ '/':file
  return $! "convert " ++ infile ++ " -rotate " ++ show angle ++ 
    " -crop 143x143+140+140 " ++ outfile

main :: IO ()
main = do
  putStrLn "TEST 1"
  (indir:outdir:_) <- getArgs
  files <- getDirectoryContents indir
  putStrLn $ show (length files)
  let imageFiles = filter (`notElem` [".", ".."]) files
  -- mapM_ (genCommand indir outdir >=> putStrLn) imageFiles
  mapM_ (\filename -> genCommand indir outdir filename >>= putStrLn) imageFiles

I compile it with the command ghc --make -O2 amy2.hs -rtsopts. If I run it with the command ./amy2 ~/nosync/GalaxyZoo/table2/images/ wombat, I get

TEST 1
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.

If I instead run it with the command ./amy2 ~/nosync/GalaxyZoo/table2/images/ wombat +RTS -K20M, I get the correct output...eventually:

TEST 1
667946
convert /home/amy/nosync/GalaxyZoo/table2/images//587736546846572812.jpeg -rotate 0 -crop 143x143+140+140 wombat/587736546846572812.jpeg
convert /home/amy/nosync/GalaxyZoo/table2/images//587736542558617814.jpeg -rotate 0 -crop 143x143+140+140 wombat/587736542558617814.jpeg

...and so on.


Solution

  • This isn't really a strictness issue(*), but an order of evaluation issue. Unlike lazily evaluated pure values, monadic effects must happen in deterministic order. mapM executes every action in the given list and gathers the results, but it cannot return until the whole list of actions is executed, so you don't get the same streaming behavior as with pure list functions.

    The easy fix in this case is to run both genCommand and putStrLn inside the same mapM_. Note that mapM_ doesn't suffer from the same issue since it is not building an intermediate list.

    mapM_ (genCommand indir outdir >=> putStrLn) imageFiles
    

    The above uses the "kleisli composition operator" >=> from Control.Monad which is like the function composition operator . except for monadic functions. You can also use the normal bind and a lambda.

    mapM_ (\filename -> genCommand indir outdir filename >>= putStrLn) imageFiles
    

    For more complex I/O applications where you want better composability between small, monadic stream processors, you should use a library such as conduit or pipes.

    Also, make sure you are compiling with either -O or -O2.

    (*) To be exact, it is also a strictness issue, because in addition to building a large, intermediate list in memory, laziness causes mapM to build unnecessary thunks and use up stack.

    EDIT: So it seems the main culprit might be getDirectoryContents. Looking at the function's source code, it essentially does the same kind of list accumulation internally as mapM.

    In order to do streaming directory listing, we need to use System.Posix.Directory which unfortunately makes the program incompatible with non-POSIX systems (like Windows). You can stream the directory contents by e.g. using continuation passing style

    import System.Environment (getArgs)
    import Control.Monad ((>=>))
    
    import System.Posix.Directory (openDirStream, readDirStream, closeDirStream)
    import Control.Exception (bracket)
    
    genCommand :: FilePath -> FilePath -> FilePath -> IO String
    genCommand indir outdir file = do
      let infile = indir ++ '/':file
      let angle = 0 -- have to actually read the file to calculate this for real
      let outfile = outdir ++ '/':file
      return $! "convert " ++ infile ++ " -rotate " ++ show angle ++
        " -crop 143x143+140+140 " ++ outfile
    
    streamingDirContents :: FilePath -> (FilePath -> IO ()) -> IO ()
    streamingDirContents root cont = do
        let loop stream = do
                fp <- readDirStream stream
                case fp of
                    [] -> return ()
                    _   | fp `notElem` [".", ".."] -> cont fp >> loop stream
                        | otherwise -> loop stream
        bracket (openDirStream root) loop closeDirStream
    
    
    main :: IO ()
    main = do
      putStrLn "TEST 1"
      (indir:outdir:_) <- getArgs
      streamingDirContents indir (genCommand indir outdir >=> putStrLn)
    

    Here's how you could do the same thing using conduit:

    import System.Environment (getArgs)
    
    import System.Posix.Directory (openDirStream, readDirStream, closeDirStream)
    
    import Data.Conduit
    import qualified  Data.Conduit.List as L
    import Control.Monad.IO.Class (liftIO, MonadIO)
    
    genCommand :: FilePath -> FilePath -> FilePath -> IO String
    genCommand indir outdir file = do
      let infile = indir ++ '/':file
      let angle = 0 -- have to actually read the file to calculate this for real
      let outfile = outdir ++ '/':file
      return $! "convert " ++ infile ++ " -rotate " ++ show angle ++
        " -crop 143x143+140+140 " ++ outfile
    
    dirSource :: (MonadResource m, MonadIO m) => FilePath -> Source m FilePath
    dirSource root = do
        bracketP (openDirStream root) closeDirStream $ \stream -> do
            let loop = do
                    fp <- liftIO $ readDirStream stream
                    case fp of
                        [] -> return ()
                        _  -> yield fp >> loop
            loop
    
    main :: IO ()
    main = do
        putStrLn "TEST 1"
        (indir:outdir:_) <- getArgs
        let files    = dirSource indir $= L.filter (`notElem` [".", ".."])
            commands = files $= L.mapM (liftIO . genCommand indir outdir)
    
        runResourceT $ commands $$ L.mapM_ (liftIO . putStrLn)
    

    The nice thing about conduit is that you regain the ability to compose pieces of functionality with things like conduit versions of filter and mapM. The $= operator streams stuff forward in the chain and $$ connects the stream to a consumer.

    The not-so-nice thing is that real world is complicated and writing efficient and robust code requires us to jump through some hoops with resource management. That's why all the operations work in the ResourceT monad transformer which keeps track of e.g. open file handles and cleans them up promptly and deterministically when they are no longer needed or e.g. if the computation gets aborted by an exception (this is in contrast to using lazy I/O and relying on the garbage collector to eventually release any scarce resources).

    However, this means that we a) need to run the final resulting conduit operation with runResourceT and b) we need to explicitly lift I/O operations to the transformed monad using liftIO instead of being able to directly write e.g. L.mapM_ putStrLn.