Search code examples
xmlhaskellhxt

How to wrap a the entire list of matches from HXT with a datatype constructor?


I'm learning HXT at the moment by using it to parse a GPX file. An example is here. I've got the following so far:

import Data.Time
import Text.XML.HXT.Core

data Gpx    = Gpx [Trk]           deriving (Show)
data Trk    = Trk [TrkSeg]        deriving (Show)
data TrkSeg = TrkSeg [TrkPt]      deriving (Show)
data TrkPt  = TrkPt Double Double deriving (Show)

parseGpx =
  getChildren >>> isElem >>> hasName "gpx" >>>
  getChildren >>> isElem >>> hasName "trk" >>>
  parseGpxTrk >>> arr Gpx

parseGpxTrk = undefined
parseGpxTrkSegs = undefined

You can see that it's incomplete, but it should still type-check. Unfortunately, I'm already running into an error:

Couldn't match type ‘Trk’ with ‘[Trk]’
Expected type: Trk -> Gpx
  Actual type: [Trk] -> Gpx
In the first argument of ‘arr’, namely ‘Gpx’
In the second argument of ‘(>>>)’, namely ‘arr Gpx’

What this error says is that I'm trying to pass each matched item from the parseGpxTrk arrow through the arr Gpx constructor, but what I actually want is to pass the entire list of matches through the arr Gpx constructor.

So, how do I get HXT (or arrows in general?) to pass the matches as a list through my arr Gpx constructor instead of passing each entry in the list through the arr Gpx constructor?


Solution

  • Here's a solution that seems pretty good to me

    {-# LANGUAGE Arrows #-}
    
    import Data.Maybe
    import Text.Read
    import Text.XML.HXT.Core
    import Control.Applicative
    
    data Gpx    = Gpx [Trk]           deriving (Show)
    data Trk    = Trk [TrkSeg]        deriving (Show)
    data TrkSeg = TrkSeg [TrkPt]      deriving (Show)
    data TrkPt  = TrkPt Double Double deriving (Show)
    

    The trickiest one is probably the parseTrkPt because in order to do it right you have to handle parsing Strings to Double, which can fail. I've made the decision to have it return a Maybe TrkPt instead, and then handle that further down the line:

    elemsNamed :: ArrowXml cat => String -> cat XmlTree XmlTree
    elemsNamed name = isElem >>> hasName name
    
    parseTrkPt :: ArrowXml cat => cat XmlTree (Maybe TrkPt)
    parseTrkPt = elemsNamed "trkpt" >>>
        proc trkpt -> do
            lat <- getAttrValue "lat" -< trkpt
            lon <- getAttrValue "lon" -< trkpt
            returnA -< TrkPt <$> readMaybe lat <*> readMaybe lon
    

    I've also used the proc syntax here because I think it comes out a lot cleaner. The TrkPt <$> readMaybe lat <*> readMaybe lon has the type Maybe TrkPt and will return Nothing if either of the readMaybes returns Nothing. We can now aggregate all the successful results:

    parseTrkSeg :: (ArrowXml cat, ArrowList cat) => cat XmlTree TrkSeg
    parseTrkSeg =
        elemsNamed "trkseg" >>>
        (getChildren >>> parseTrkPt >>. catMaybes) >. TrkSeg
    

    The parentheses are important here, it took me a while to figure that part out. Depending on where you place the parens you'll get different results, such as [TrkSeg [TrkPt a b], TrkSeg [TrkPt c d]] instead of [TrkSeg [TrkPt a b, TrkPt c d]]. The next to parsers are both straightforward following a similar pattern:

    parseTrk :: ArrowXml cat => cat XmlTree Trk
    parseTrk =
        elemsNamed "trk" >>>
        (getChildren >>> parseTrkSeg) >. Trk
    
    parseGpx :: ArrowXml cat => cat XmlTree Gpx
    parseGpx =
        elemsNamed "gpx" >>>
        (getChildren >>> parseTrk) >. Gpx
    

    Then you can run it quite simply, although you'll have to still drill passed the root element:

    main :: IO ()
    main = do
        gpxs <- runX $ readDocument [withRemoveWS yes] "ana.gpx"
                    >>> getChildren
                    >>> parseGpx
        -- Pretty print the document
        forM_ gpxs $ \(Gpx trks) -> do
            putStrLn "GPX:"
            forM_ trks $ \(Trk segs) -> do
                putStrLn "\tTRK:"
                forM_ segs $ \(TrkSeg pts) -> do
                    putStrLn "\t\tSEG:"
                    forM_ pts $ \pt -> do
                        putStr "\t\t\t"
                        print pt
    

    The trick is to use the methods in the ArrowList typeclass, notably >. which has the type a b c -> ([c] -> d) -> a b d. It aggregates the elements from the ArrowList, passes it to a function that converts it to a new type, then outputs a new ArrowList on that new type d.

    If you want you can even abstract this a bit for the last 3 parsers:

    nestedListParser :: ArrowXml cat => String -> cat XmlTree a -> ([a] -> b) -> cat XmlTree b
    nestedListParser name subparser constructor
        =   elemsNamed name
        >>> (getChildren >>> subparser)
        >.  constructor
    
    parseTrkSeg :: (ArrowXml cat, ArrowList cat) => cat XmlTree TrkSeg
    parseTrkSeg = nestedListParser "trkseg" (parseTrkPt >>. catMaybes) TrkSeg
    
    parseTrk :: ArrowXml cat => cat XmlTree Trk
    parseTrk = nestedListParser "trk" parseTrkSeg Trk
    
    parseGpx :: ArrowXml cat => cat XmlTree Gpx
    parseGpx = nestedListParser "gpx" parseTrk Gpx
    

    This might come in handy if you want to complete the rest of the grammar of a GPX file.