groovy html-parsing xmlslurper nodechildren

Groovy XmlSlurper get value out of NodeChildren

I'm parsing HTML and trying to get full / not parsed value out of one particular node.

HTML example:

<html>
    <body>
        <div>Hello <br> World <br> !</div>
        <div><object width="420" height="315"></object></div>
    </body>
</html>

Code:

def tagsoupParser = new org.ccil.cowan.tagsoup.Parser()
def slurper = new XmlSlurper(tagsoupParser)
def htmlParsed = slurper.parseText(stringToParse)

println htmlParsed.body.div[0]

However it returns only text in case of first node and I get empty string for the second node. Question: how can I retrieve value of the first node such that I get:

Hello <br> World <br> !

Solution

This is what I used to get the content from the first div tag (omitting xml declaration and namespaces).

Groovy

@Grab('org.ccil.cowan.tagsoup:tagsoup:1.2.1')
import org.ccil.cowan.tagsoup.Parser
import groovy.xml.*

def html = """<html>
    <body>
        <div>Hello <br> World <br> !</div>
        <div><object width="420" height="315"></object></div>
    </body>
</html>"""

def parser = new Parser()
parser.setFeature('http://xml.org/sax/features/namespaces',false)
def root = new XmlSlurper(parser).parseText(html)
println new StreamingMarkupBuilder().bindNode(root.body.div[0]).toString()

Gives

<div>Hello <br clear='none'></br> World <br clear='none'></br> !</div>

N.B. Unless I'm mistaken, Tagsoup is adding the closing tags. If you literally want Hello <br> World <br> !, you might have to use a different library (maybe regex?).

I know it's including the div element in the output... is this a problem?