Search code examples
haskellfunctional-programmingbinary-treemonadsmonad-transformers

How do you save a tree data structure to binary file in Haskell


I'm trying to save a simple (but quite big) Tree structure into a binary file using Haskell. The structure looks something like this:

-- For simplicity assume each Node has only 4 childs
data Tree = Node [Tree] | Leaf [Int]
And here is how I need the data look on disk:

  1. Each node starts with four 32-bit offsets to it's children, then follow the childs.
  2. I don't care much about the leafs, let's say it's just n consecutive 32-bit numbers.
  3. For practival purposes I would need some node labels or some other additional data but right now I don't care about that much neither.

It apears to me that Haskellers first choice when writing binary files is the Data.Binary.Put library. But with that I have a problem in the bullet #1. In particular, when I'm about to write a Node to a file, to write down the child offsets I need to know my current offset and the size of each child.

This is not something that Data.Binary.Put provides so I thought this must be a perfect application of Monad transformers. But even though it sounds cool and functional, so far I have not been successfull with this approach.

I asked two other questions that I thought would help me solve the problem here and here. I must say that each time I received very nice answers that helped me progress further but unfortunatelly I am still unable to solve the problem as a whole.

Here is what I've got so far, it still leaks too much memory to be practical.

I would love to have solution that uses such functional approach, but would be grateful for any other solution as well.


Solution

  • Here is implementation of two pass solution proposed by sclv.

    import qualified Data.ByteString.Lazy as L
    import Data.Binary.Put
    import Data.Word
    import Data.List (foldl')
    
    data Tree = Node [Tree] | Leaf [Word32] deriving Show
    
    makeTree 0 = Leaf $ replicate 100 0xdeadbeef
    makeTree n = Node $ replicate 4 $ makeTree $ n-1
    

    SizeTree mimics original Tree, it does not contain data but at each node it stores size of corresponding child in Tree.
    We need to have SizeTree in memory, so it worth to make it more compact (e.g. replace Ints with uboxed words).

    data SizeTree
      = SNode {sz :: Int, chld :: [SizeTree]}
      | SLeaf {sz :: Int}
      deriving Show
    

    With SizeTree in memory it is possible to serialize original Tree in streaming fashion.

    putTree :: Tree -> SizeTree -> Put
    putTree (Node xs) (SNode _ ys) = do
      putWord8 $ fromIntegral $ length xs          -- number of children
      mapM_ (putWord32be . fromIntegral . sz) ys   -- sizes of children
      sequence_ [putTree x y | (x,y) <- zip xs ys] -- children data
    putTree (Leaf xs) _ = do
      putWord8 0                                   -- zero means 'leaf'
      putWord32be $ fromIntegral $ length xs       -- data length
      mapM_ putWord32be xs                         -- leaf data
    
    
    mkSizeTree :: Tree -> SizeTree
    mkSizeTree (Leaf xs) = SLeaf (1 + 4 + 4 * length xs)
    mkSizeTree (Node xs) = SNode (1 + 4 * length xs + sum' (map sz ys)) ys
      where
        ys = map mkSizeTree xs
        sum' = foldl' (+) 0
    

    It is important to prevent GHC from merging two passes into one (in which case it will hold tree in memory). Here it is done by feeding not tree but tree generator to the function.

    serialize mkTree size = runPut $ putTree (mkTree size) treeSize
      where
        treeSize = mkSizeTree $ mkTree size
    
    main = L.writeFile "dump.bin" $ serialize makeTree 10