Search code examples
groovyhtml-parsingxmlslurpernodechildren

Groovy XmlSlurper get value out of NodeChildren


I'm parsing HTML and trying to get full / not parsed value out of one particular node.

HTML example:

<html>
    <body>
        <div>Hello <br> World <br> !</div>
        <div><object width="420" height="315"></object></div>
    </body>
</html>

Code:

def tagsoupParser = new org.ccil.cowan.tagsoup.Parser()
def slurper = new XmlSlurper(tagsoupParser)
def htmlParsed = slurper.parseText(stringToParse)

println htmlParsed.body.div[0]

However it returns only text in case of first node and I get empty string for the second node. Question: how can I retrieve value of the first node such that I get:

Hello <br> World <br> !

Solution

  • This is what I used to get the content from the first div tag (omitting xml declaration and namespaces).

    Groovy

    @Grab('org.ccil.cowan.tagsoup:tagsoup:1.2.1')
    import org.ccil.cowan.tagsoup.Parser
    import groovy.xml.*
    
    def html = """<html>
        <body>
            <div>Hello <br> World <br> !</div>
            <div><object width="420" height="315"></object></div>
        </body>
    </html>"""
    
    def parser = new Parser()
    parser.setFeature('http://xml.org/sax/features/namespaces',false)
    def root = new XmlSlurper(parser).parseText(html)
    println new StreamingMarkupBuilder().bindNode(root.body.div[0]).toString()
    

    Gives

    <div>Hello <br clear='none'></br> World <br clear='none'></br> !</div>
    

    N.B. Unless I'm mistaken, Tagsoup is adding the closing tags. If you literally want Hello <br> World <br> !, you might have to use a different library (maybe regex?).

    I know it's including the div element in the output... is this a problem?