I'm trying to understand how to use the iteratee library with Haskell. All of the articles I've seen so far seem to focus on building an intuition for how iteratees could be built, which is helpful, but now that I want to get down and actually use them, I feel a bit at sea. Looking at the source code for iteratees has been of limited value for me.
Let's say I have this function which trims trailing whitespace from a line:
import Data.ByteString.Char8
rstrip :: ByteString -> ByteString
rstrip = fst . spanEnd isSpace
What I'd like to do is: make this into an iteratee, read a file and write it out somewhere else with the trailing whitespace stripped from each line. How would I go about structuring that with iteratees? I see there's an enumLinesBS
function in Data.Iteratee.Char which I could plumb into this, but I don't know if I should use mapChunks
or convStream
or how to repackage the function above into an iteratee.
If you just want code, it's this:
procFile' iFile oFile = fileDriver (joinI $
enumLinesBS ><>
mapChunks (map rstrip) $
I.mapM_ (B.appendFile oFile))
iFile
Commentary:
This is a three-stage process: first you transform the raw stream into a stream of lines, then you apply your function to convert that stream of lines, and finally you consume the stream. Since rstrip
is in the middle stage, it will be creating a stream transformer (Enumeratee).
You can use either mapChunks
or convStream
, but mapChunks
is simpler. The difference is that mapChunks
doesn't allow for you to cross chunk boundaries, whereas convStream
is more general. I prefer convStream
because it doesn't expose any of the underlying implementation, but if mapChunks
is sufficient the resulting code is usually shorter.
rstripE :: Monad m => Enumeratee [ByteString] [ByteString] m a
rstripE = mapChunks (map rstrip)
Note the extra map
in rstripE
. The outer stream (which is the input to rstrip) has type [ByteString]
, so we need to map rstrip
onto it.
For comparison, this is what it would look like if implemented with convStream:
rstripE' :: Enumeratee [ByteString] [ByteString] m a
rstripE' = convStream $ do
mLine <- I.peek
maybe (return B.empty) (\line -> I.drop 1 >> return (rstrip line)) mLine
This is longer, and it's less efficient because it will only apply the rstrip function to one line at a time, even though more lines may be available. It's possible to work on all of the currently available chunk, which is closer to the mapChunks
version:
rstripE'2 :: Enumeratee [ByteString] [ByteString] m a
rstripE'2 = convStream (liftM (map rstrip) getChunk)
Anyway, with the stripping enumeratee available, it's easily composed with the enumLinesBS
enumeratee:
enumStripLines :: Monad m => Enumeratee ByteString [ByteString] m a
enumStripLines = enumLinesBS ><> rstripE
The composition operator ><>
follows the same order as the arrow operator >>>
. enumLinesBS
splits the stream into lines, then rstripE
strips them. Now you just need to add a consumer (which is a normal iteratee), and you're done:
writer :: FilePath -> Iteratee [ByteString] IO ()
writer fp = I.mapM_ (B.appendFile fp)
processFile iFile oFile =
enumFile defaultBufSize iFile (joinI $ enumStripLines $ writer oFile) >>= run
The fileDriver
functions are shortcuts for simply enumerating over a file and running the resulting iteratee (unfortunately the argument order is switched from enumFile):
procFile2 iFile oFile = fileDriver (joinI $ enumStripLines $ writer oFile) iFile
Addendum: here's a situation where you would need the extra power of convStream. Suppose you want to concatenate every 2 lines into one. You can't use mapChunks
. Consider when the chunk is a singleton element, [bytestring]
. mapChunks
doesn't provide any way to access the next chunk, so there's nothing else to concatenate with this. With convStream
however, it's simple:
concatPairs = convStream $ do
line1 <- I.head
line2 <- I.head
return $ line1 `B.append` line2
this looks even nicer in applicative style,
convStream $ B.append <$> I.head <*> I.head
You can think of convStream
as continually consuming a portion of the stream with the provided iteratee, then sending the transformed version to the inner consumer. Sometimes even this isn't general enough, since the same iteratee is called at each step. In that case, you can use unfoldConvStream
to pass state between successive iterations.
convStream
and unfoldConvStream
also allow for monadic actions, since the stream processing iteratee is a monad transformer.