Search code examples
haskellhxt

Running Haskell HXT outside of IO?


All the examples I've seen so far using the Haskell XML toolkit, HXT, uses runX to execute the parser. runX runs inside the IO monad. Is there a way of using this XML parser outside of IO? Seems to be a pure operation to me, don't understand why I'm forced to be inside IO.


Solution

  • You can use HXT's xread along with runLA to parse an XML string outside of IO.

    xread has the following type:

    xread :: ArrowXml a => a String XmlTree
    

    This means you can compose it with any arrow of type (ArrowXml a) => a XmlTree Whatever to get an a String Whatever.

    runLA is like runX, but for things of type LA:

    runLA :: LA a b -> a -> [b]
    

    LA is an instance of ArrowXml.

    To put this all together, the following version of my answer to your previous question uses HXT to parse a string containing well-formed XML without any IO involved:

    {-# LANGUAGE Arrows #-}
    module Main where
    
    import qualified Data.Map as M
    import Text.XML.HXT.Arrow
    
    classes :: (ArrowXml a) => a XmlTree (M.Map String String)
    classes = listA (divs >>> pairs) >>> arr M.fromList
      where
        divs = getChildren >>> hasName "div"
        pairs = proc div -> do
          cls <- getAttrValue "class" -< div
          val <- deep getText         -< div
          returnA -< (cls, val)
    
    getValues :: (ArrowXml a) => [String] -> a XmlTree (String, Maybe String)
    getValues cs = classes >>> arr (zip cs . lookupValues cs) >>> unlistA
      where lookupValues cs m = map (flip M.lookup m) cs
    
    xml = "<div><div class='c1'>a</div><div class='c2'>b</div>\
          \<div class='c3'>123</div><div class='c4'>234</div></div>"
    
    values :: [(String, Maybe String)]
    values = runLA (xread >>> getValues ["c1", "c2", "c3", "c4"]) xml
    
    main = print values
    

    classes and getValues are similar to the previous version, with a few minor changes to suit the expected input and output. The main difference is that here we use xread and runLA instead of readString and runX.

    It would be nice to be able to read something like a lazy ByteString in a similar manner, but as far as I know this isn't currently possible with HXT.


    A couple of other things: you can parse strings in this way without IO, but it's probably better to use runX whenever you can: it gives you more control over the configuration of the parser, error messages, etc.

    Also: I tried to make the code in the example straightforward and easy to extend, but the combinators in Control.Arrow and Control.Arrow.ArrowList make it possible to work with arrows much more concisely if you like. The following is an equivalent definition of classes, for example:

    classes = (getChildren >>> hasName "div" >>> pairs) >. M.fromList
      where pairs = getAttrValue "class" &&& deep getText