I need to get text contents of first <p>
which is children of <div class="about">
, wrote the following code:
tagTextS :: IOSArrow XmlTree String
tagTextS = getChildren >>> getText >>> arr stripString
parseDescription :: IOSArrow XmlTree String
parseDescription =
(
deep (isElem >>> hasName "div" >>> hasAttrValue "id" (== "company_about_full_description"))
>>> (arr (\x -> x) /> isElem >>> hasName "p") >. (!! 0) >>> tagTextS
) `orElse` (constA "")
Look at this arr (\x -> x)
– without it I wasn't be able to reach result.
parseDescription
?arr
and after hasName "p"
? (I
actually found this solution here)Another proposal using hxt core as you demand.
To enforce the first child, cannot be done through getChildren output, since hxt arrows have a specific (>>>) that maps subsequent arrows to every list item of precedent output and not the output list, as explained in the haskellWiki hxt page although this is an old definition, actually it derives from Category (.) composition.
getNthChild can be hacked from getChildren of Control.Arrow.ArrowTree
import Data.Tree.Class (Tree)
import qualified Data.Tree.Class as T
-- if the nth element does not exist it will return an empty children list
getNthChild :: (ArrowList a, Tree t) => Int -> a (t b) (t b)
getNthChild n = arrL (take 1 . drop n . T.getChildren)
then your parseDescription could take this form:
-- importing Text.XML.HXT.Arrow.XmlArrow (hasName, hasAttrValue)
parseDescription =
deep (isElem >>> hasName "div" >>> hasAttrValue "class" (== "about")
>>> getNthChild 0 >>> hasName "p"
)
>>> getChildren >>> getText
Update. I found another way using changeChildren:
getNthChild :: (ArrowTree a, Tree t) => Int -> a (t b) (t b)
getNthChild n = changeChildren (take 1 . drop n) >>> getChildren
Update: avoid inter-element spacing-nodes filtering non-element children
import qualified Text.XML.HXT.DOM.XmlNode as XN
getNthChild :: (ArrowTree a, Tree t, XN.XmlNode b) => Int -> a (t b) (t b)
getNthChild n = changeChildren (take 1 . drop n . filter XN.isElem) >>> getChildren