Search code examples
haskelllazy-evaluationbytestring

Haskell - Reading entire Lazy ByteString


Context: I have a function defined in a library called toXlsx :: ByteString -> Xlsx (that ByteString is from Data.ByteString.Lazy)

Now to do certain operations I've defined certain functions that operate on the same file, thus I would like to open, read and convert to Xlsx the file once and keep it in memory to operate with it.

Right now I'm reading the file as bs <- Data.ByteString.Lazy.readfile file and at the end doing Data.ByteString.Lazy.length bs 'seq' return value.

Is there any way to use this function and keep the file in memory as a whole to reuse it?


Solution

  • Note that the way a lazy bytestring works, the contents of the file won't be read until they are "used", but once they are read, they will remain in memory for any subsequent operations. The only way they will be removed from memory is if they are garbage collected because your program no longer has any way to access them.

    For example, if you run the following program on a large file:

    import qualified Data.ByteString.Lazy as BL  
    main = do
      bigFile <- BL.readFile "ubuntu-14.04-desktop-amd64.iso"
      print $ BL.length $ BL.filter (==0) bigFile     -- takes a while
      print $ BL.length $ BL.filter (==255) bigFile   -- runs fast
    

    the first computation will actually read the entire file into memory and it will be kept there for the second computation.

    I guess this by itself isn't too convincing, since the operating system will also cache the file into memory, and it ends up being hard to tell the difference in timing between Haskell reading the file from the operating system cache for each computation and keeping it in memory across all computations. But, if you ran some heap profiling on this code, you'd discover that the first operation loads up the entire file into "pinned" bytestrings and that allocation stays constant through subsequent operations.

    If your concern is that you want the complete file to be read at the start, even if the first operation doesn't need to read it all, so that there are no subsequent delays as additional parts of the file are read, then your seq-based solution is probably fine. Alternatively, you can read the entire file as a strict bytestring and then convert it using fromStrict -- this operation is instantaneous and doesn't copy any data. (In contrast to toStrict, which is expensive and does copy data.) So this will work:

    import qualified Data.ByteString as BS
    import qualified Data.ByteString.Lazy as BL
    
    main = do
      -- read strict
      bigFile <- BS.readFile "whatever.mov"
      -- do strict and lazy operations
      print $ strictOp bigFile
      print $ lazyOp (BL.fromStrict bigFile)