Search code examples
xmlhaskell

How do I extract this text from an xml file in Haskell


I have an xml file in a similar format to this. There are multiple parent tags:

<file>
<parent title = "counting">
<child>
  <notme></notme>
  <me>One <a>Two</a> Three</me>
  <notwanted></notwanted>
  <me>Four Five <a>Six</a></me>
  <notwanted></notwanted>
  <me>Seven Eight <a>Nine</a></me>
</child>
</parent>
</file>

I want to process that xml file and get the following output, for each parent tag:

Title - counting
One Two Three
Four Five Six
Seven Eight Nine

I've tried to do this with various different xml libraries (hxt, text.xml,...) but am not having much success. Getting the text from within the "a" tag properly embedded inside the surrounding text is the problem. I'm just after a small function that will do this. Can anyone help or suggest the most appropriate library?


Solution

  • This can probably be done with scalpel [Hackage]. We can make a parser:

    {-# LANGUAGE OverloadedStrings #-}
    
    import Control.Applicative(liftA2)
    import Text.HTML.Scalpel(Scraper, (//))
    
    mes :: Scraper String [String]
    mes = chroot "parent" (liftA2 (<>) (attrs "title" (atDepth anySelector 0)) (texts "me"))

    this gives us for the sample data:

    ghci> scrapeStringLike t mes
    Just [["counting","One Two Three","Four Five Six","Seven Eight Nine"]]
    

    It thus will generate a sublist per parent.