Search code examples
haskellhxt

HXT getting first element: refactor weird arrow


I need to get text contents of first <p> which is children of <div class="about">, wrote the following code:

tagTextS :: IOSArrow XmlTree String
tagTextS = getChildren >>> getText >>> arr stripString

parseDescription :: IOSArrow XmlTree String
parseDescription =
  (
   deep (isElem >>> hasName "div" >>> hasAttrValue "id" (== "company_about_full_description"))
   >>> (arr (\x -> x) /> isElem  >>> hasName "p") >. (!! 0) >>> tagTextS
  ) `orElse` (constA "")

Look at this arr (\x -> x) – without it I wasn't be able to reach result.

  • Is there a better way to write parseDescription?
  • Another question is why do I need parentheses before arr and after hasName "p"? (I actually found this solution here)

Solution

  • Another proposal using hxt core as you demand.

    To enforce the first child, cannot be done through getChildren output, since hxt arrows have a specific (>>>) that maps subsequent arrows to every list item of precedent output and not the output list, as explained in the haskellWiki hxt page although this is an old definition, actually it derives from Category (.) composition.

    getNthChild can be hacked from getChildren of Control.Arrow.ArrowTree

    import Data.Tree.Class (Tree)
    import qualified Data.Tree.Class as T
    
    -- if the nth element does not exist it will return an empty children list
    
    getNthChild :: (ArrowList a, Tree t) => Int -> a (t b) (t b)
    getNthChild n = arrL (take 1 . drop n . T.getChildren)
    

    then your parseDescription could take this form:

    -- importing Text.XML.HXT.Arrow.XmlArrow (hasName, hasAttrValue)
    
    parseDescription = 
        deep (isElem >>> hasName "div" >>> hasAttrValue "class" (== "about") 
              >>> getNthChild 0 >>> hasName "p"
              ) 
        >>> getChildren >>> getText
    

    Update. I found another way using changeChildren:

    getNthChild :: (ArrowTree a, Tree t) => Int -> a (t b) (t b)
    getNthChild n = changeChildren (take 1 . drop n) >>> getChildren
    

    Update: avoid inter-element spacing-nodes filtering non-element children

    import qualified Text.XML.HXT.DOM.XmlNode as XN
    
    getNthChild :: (ArrowTree a, Tree t, XN.XmlNode b) => Int -> a (t b) (t b)
    getNthChild n = changeChildren (take 1 . drop n . filter XN.isElem) >>> getChildren