Search code examples
haskelliolazy-evaluationbytestringwritefile

Why does this not run in constant memory?


I am trying to write a very large amount of data to a file in constant memory.

import qualified Data.ByteString.Lazy as B

{- Creates and writes num grids of dimensions aa x aa -}
writeGrids :: Int -> Int -> IO ()
writeGrids num aa = do
    rng <- newPureMT
    let (grids,shuffleds) = createGrids rng aa
    createDirectoryIfMissing True "data/grids/"
    B.writeFile (gridFileName num aa)
                (encode (take num grids))
    B.writeFile (shuffledFileName num aa)
                (encode (take num shuffleds))

However this consumes memory proportional to the size of num. I know createGrids is a sufficiently lazy function because I have tested it by appending error "not lazy enough" (as suggested by the Haskell wiki here) to the end of the lists it returns and no errors are raised. take is a lazy function that is defined in Data.List. encode is also a lazy function defined in Data.Binary. B.writeFile is defined in Data.ByteString.Lazy.

Here is the complete code so you can execute it:

import Control.Arrow (first)
import Data.Binary
import GHC.Float (double2Float)
import System.Random (next)
import System.Random.Mersenne.Pure64 (PureMT, newPureMT, randomDouble)
import System.Random.Shuffle (shuffle')
import qualified Data.ByteString.Lazy as B

main :: IO ()
main = writeGrids 1000 64

{- Creates and writes num grids of dimensions aa x aa -}
writeGrids :: Int -> Int -> IO ()
writeGrids num aa = do
    rng <- newPureMT
    let (grids,shuffleds) = createGrids rng aa
    B.writeFile "grids.bin" (encode (take num grids))
    B.writeFile "shuffleds.bin" (encode (take num shuffleds))

{- a random number generator, dimension of grids to make
   returns a pair of lists, the first is a list of grids of dimensions
   aa x aa, the second is a list of the shuffled grids corresponding to the first list -}
createGrids :: PureMT -> Int -> ([[(Float,Float)]],[[(Float,Float)]])
createGrids rng aa = (grids,shuffleds) where
       rs = randomFloats rng
       grids = map (getGridR aa) (chunksOf (2 * aa * aa) rs) 
       shuffleds = shuffler (aa * aa) rng grids

{- length of each grid, a random number generator, a list of grids
   returns a the list with each grid shuffled -}
shuffler :: Int -> PureMT -> [[(Float,Float)]] -> [[(Float,Float)]]
shuffler n rng (xs:xss) = shuffle' xs n rng : shuffler n (snd (next rng))         xss
shuffler _ _ [] = []

{- divides list into chunks of size n -}
chunksOf :: Int -> [a] -> [[a]]
chunksOf n = go 
     where go xs = case splitAt n xs of
              (ys,zs) | null ys -> []
                      | otherwise -> ys : go zs

{- dimension of grid, list of random floats [0,1]
   returns a list of (x,y) points of length n^2 such that all
   points are in the range [0,1] and the points are a randomly 
   perturbed regular grid -}
getGridR :: Int -> [Float] -> [(Float,Float)]
getGridR n rs = pts where
   nn = n * n
   (irs,jrs) = splitAt nn rs
   n' = fromIntegral n
   grid = [ (p,q) | p <- [0..n'-1], q <- [0..n'-1] ]
   pts = zipWith (\(p,q) (ir,jr) -> ((p+ir)/n',(q+jr)/n')) grid (zip irs jrs)

{- an infinite list of random floats in range [0,1] -}
randomFloats :: PureMT -> [Float]
randomFloats rng = let (d,rng') = first double2Float (randomDouble rng)
                   in d : randomFloats rng'

The required packages are: , bytestring , binary , random , mersenne-random-pure64 , random-shuffle


Solution

  • Two reasons for the memory usage:

    First, Data.Binary.encode doesn't seem to run in constant space. The following program uses 910 MB memory:

    import Data.Binary
    import qualified Data.ByteString.Lazy as B
    
    len = 10000000 :: Int 
    
    main = B.writeFile "grids.bin" $ encode [0..len]
    

    If we leave a 0 out from len we get 97 MB memory usage.

    In contrast, the following program uses 1 MB:

    import qualified Data.ByteString.Lazy.Char8 as B
    
    main = B.writeFile "grids.bin" $ B.pack $ show [0..(1000000::Int)]
    

    Second, in your program shuffleds contains references to contents of grids, which prevents garbage collection of grids. So when we print grids, we also evaluate it and then it has to sit in memory until we finish printing shuffleds. The following version of your program still consumes lots of memory, but it uses constant space if we comment out one of the two lines with B.writeFile.

    import qualified Data.ByteString.Lazy.Char8 as B
    
    writeGrids :: Int -> Int -> IO ()
    writeGrids num aa = do
        rng <- newPureMT
        let (grids,shuffleds) = createGrids rng aa
        B.writeFile "grids.bin" (B.pack $ show (take num grids))
        B.writeFile "shuffleds.bin" (B.pack $ show (take num shuffleds))