I am learing Haskell Lazy IO.
I am looking for an elegant way to copy a large file (8Gb) while printing copy progress to console.
Consider the following simple program that copies a file silently.
module Main where
import System
import qualified Data.ByteString.Lazy as B
main = do [from, to] <- getArgs
body <- B.readFile from
B.writeFile to body
Imgine there is a callback function you want to use for reporting:
onReadBytes :: Integer -> IO ()
onReadBytes count = putStrLn $ "Bytes read: " ++ (show count)
QUESTION: how to weave onReadBytes function into Lazy ByteString so it will be called back on successfull read? Or if this design is not good, then what is the Haskell way to do it?
NOTE: the frequency of callback is not important, it can be called every 1024 bytes or every 1 Mb -- not important
ANSWER: Many thanks to camccann for the answer. I suggest to read it entirely.
Bellow is my version of the code based on camccann's code, you may find it useful.
module Main where
import System
import System.IO
import qualified Data.ByteString.Lazy as B
main = do [from, to] <- getArgs
withFile from ReadMode $ \fromH ->
withFile to WriteMode $ \toH ->
copyH fromH toH $ \x -> putStrLn $ "Bytes copied: " ++ show x
copyH :: Handle -> Handle -> (Integer -> IO()) -> IO ()
copyH fromH toH onProgress =
copy (B.hGet fromH (256 * 1024)) (write toH) B.null onProgress
where write o x = do B.hPut o x
return . fromIntegral $ B.length x
copy :: (Monad m) => m a -> (a -> m Integer) -> (a -> Bool) -> (Integer -> m()) -> m()
copy = copy_ 0
copy_ :: (Monad m) => Integer -> m a -> (a -> m Integer) -> (a -> Bool) -> (Integer -> m()) -> m()
copy_ count inp outp done onProgress = do x <- inp
unless (done x) $
do n <- outp x
onProgress (n + count)
copy_ (n + count) inp outp done onProgress
First, I'd like to note that a fair number of Haskell programmers regard lazy IO in general with some suspicion. It technically violates purity, but in a limited way that (as far as I'm aware) isn't noticeable when running a single program on consistent input[0]. On the other hand, plenty of people are fine with it, again because it involves only a very restricted kind of impurity.
To create the illusion of a lazy data structure that's actually created with on-demand I/O, functions like readFile
are implemented using sneaky shenanigans behind the scenes. Weaving in the on-demand I/O is inherent to the function, and it's not really extensible for pretty much the same reasons that the illusion of getting a regular ByteString
from it is convincing.
Handwaving the details and writing pseudocode, something like readFile basically works like this:
lazyInput inp = lazyIO (lazyInput' inp)
lazyInput' inp = do x <- readFrom inp
if (endOfInput inp)
then return []
else do xs <- lazyInput inp
return (x:xs)
...where each time lazyIO
is called, it defers the I/O until the value is actually used. To invoke your reporting function each time the actual read occurs, you'd need to weave it in directly, and while a generalized version of such a function could be written, to my knowledge none exist.
Given the above, you have a few options:
Look up the implementation of the lazy I/O functions you're using, and implement your own that include the progress reporting function. If this feels like a dirty hack, that's because it pretty much is, but there you go.
Abandon lazy I/O and switch to something more explicit and composable. This is the direction that the Haskell community as a whole seems to be heading in, specifically toward some variation on Iteratees, which give you nicely composable little stream processor building blocks that have more predictable behavior. The downside is that the concept is still under active development so there's no consensus on implementation or single starting point for learning to use them.
Abandon lazy I/O and switch to plain old regular I/O: Write an IO
action that reads a chunk, prints the reporting info, and processes as much input as it can; then invoke it in a loop until done. Depending on what you're doing with the input and how much you're relying on laziness in your processing, this could involve anything from writing a couple nearly-trivial functions to building a bunch of finite-state-machine stream processors and getting 90% of the way to reinventing Iteratees.
[0]: The underlying function here is called unsafeInterleaveIO
, and to the best of my knowledge the only ways to observe impurity from it require either running the program on different input (in which case it's entitled to behave differently anyhow, it just may be doing so in ways that don't make sense in pure code), or changing the code in certain ways (i.e., refactorings that should have no effect can have non-local effects).
Here's a rough example of doing things the "plain old regular I/O" way, using more composable functions:
import System
import System.IO
import qualified Data.ByteString.Lazy as B
main = do [from, to] <- getArgs
-- withFile closes the handle for us after the action completes
withFile from ReadMode $ \inH ->
withFile to WriteMode $ \outH ->
-- run the loop with the appropriate actions
runloop (B.hGet inH 128) (processBytes outH) B.null
-- note the very generic type; this is useful, because it proves that the
-- runloop function can only execute what it's given, not do anything else
-- behind our backs.
runloop :: (Monad m) => m a -> (a -> m ()) -> (a -> Bool) -> m ()
runloop inp outp done = do x <- inp
if done x
then return ()
else do outp x
runloop inp outp done
-- write the output and report progress to stdout. note that this can be easily
-- modified, or composed with other output functions.
processBytes :: Handle -> B.ByteString -> IO ()
processBytes h bs | B.null bs = return ()
| otherwise = do onReadBytes (fromIntegral $ B.length bs)
B.hPut h bs
onReadBytes :: Integer -> IO ()
onReadBytes count = putStrLn $ "Bytes read: " ++ (show count)
The "128" up there is how many bytes to read at a time. Running this on a random source file in my "Stack Overflow snippets" directory:
$ runhaskell ReadBStr.hs Corec.hs temp
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 83
$