All the examples I've seen so far using the Haskell XML toolkit, HXT, uses runX
to execute the parser. runX
runs inside the IO monad. Is there a way of using this XML parser outside of IO? Seems to be a pure operation to me, don't understand why I'm forced to be inside IO.
You can use HXT's xread
along with runLA
to parse an XML string outside of IO
.
xread
has the following type:
xread :: ArrowXml a => a String XmlTree
This means you can compose it with any arrow of type (ArrowXml a) => a XmlTree Whatever
to get an a String Whatever
.
runLA
is like runX
, but for things of type LA
:
runLA :: LA a b -> a -> [b]
LA
is an instance of ArrowXml
.
To put this all together, the following version of my answer to your previous question uses HXT to parse a string containing well-formed XML without any IO
involved:
{-# LANGUAGE Arrows #-}
module Main where
import qualified Data.Map as M
import Text.XML.HXT.Arrow
classes :: (ArrowXml a) => a XmlTree (M.Map String String)
classes = listA (divs >>> pairs) >>> arr M.fromList
where
divs = getChildren >>> hasName "div"
pairs = proc div -> do
cls <- getAttrValue "class" -< div
val <- deep getText -< div
returnA -< (cls, val)
getValues :: (ArrowXml a) => [String] -> a XmlTree (String, Maybe String)
getValues cs = classes >>> arr (zip cs . lookupValues cs) >>> unlistA
where lookupValues cs m = map (flip M.lookup m) cs
xml = "<div><div class='c1'>a</div><div class='c2'>b</div>\
\<div class='c3'>123</div><div class='c4'>234</div></div>"
values :: [(String, Maybe String)]
values = runLA (xread >>> getValues ["c1", "c2", "c3", "c4"]) xml
main = print values
classes
and getValues
are similar to the previous version, with a few minor changes to suit the expected input and output. The main difference is that here we use xread
and runLA
instead of readString
and runX
.
It would be nice to be able to read something like a lazy ByteString
in a similar manner, but as far as I know this isn't currently possible with HXT.
A couple of other things: you can parse strings in this way without IO
, but it's probably better to use runX
whenever you can: it gives you more control over the configuration of the parser, error messages, etc.
Also: I tried to make the code in the example straightforward and easy to extend, but the combinators in Control.Arrow
and Control.Arrow.ArrowList
make it possible to work with arrows much more concisely if you like. The following is an equivalent definition of classes
, for example:
classes = (getChildren >>> hasName "div" >>> pairs) >. M.fromList
where pairs = getAttrValue "class" &&& deep getText