Search code examples
groovyhtml-parsingxmlslurper

Groovy XmlParser / XmlSlurper: node.localText() position?


I have a follow-up question for this question: Groovy XmlSlurper get value of the node without children.

It explains that in order to get the local inner text of a (HTML) node without recursively get the nested text of potential inner child nodes as well, one has to use #localText() instead of #text().

For instance, a slightly enhanced example from the original question:

<html>
    <body>
        <div>
            Text I would like to get1.
            <a href="http://intro.com">extra stuff</a>
            Text I would like to get2.
            <a href="http://example.com">link to example</a>
            Text I would like to get3.
        </div>
        <span>
            <a href="http://intro.com">extra stuff</a>
            Text I would like to get2.
            <a href="http://example.com">link to example</a>
            Text I would like to get3.
        </span>
    </body>
</html>

with the solution applied:

def tagsoupParser = new org.ccil.cowan.tagsoup.Parser()
def slurper = new XmlSlurper(tagsoupParser)
def htmlParsed = slurper.parseText(stringToParse)

println htmlParsed.body.div[0].localText()[0]

would return:

[Text I would like to get1., Text I would like to get2., Text I would like to get3.]

However, when parsing the <span> part in this example

println htmlParsed.body.span[0].localText()

the output is

[Text I would like to get2., Text I would like to get3.]

The problem I am facing now is that it's apparently not possible to pinpoint the location ("between which child nodes") of the texts. I would have expected the second invocation to yield

[, Text I would like to get2., Text I would like to get3.]

This would have made it clear: Position 0 (before child 0) is empty, position 1 (between child 0 and 1) is "Text I would like to get2.", and position 2 (between child 1 and 2) is "Text I would like to get3." But given the API works as it does, there is apparently no way to determine whether the text returned at index 0 is actually positioned at index 0 or at any other index, and the same is true for all the other indices.

I have tried it with both XmlSlurper and XmlParser, yielding the same results.

If I'm not mistaken here, it's as a consequence also impossible to completely recreate an original HTML document using the information from the parser because this "text index" information is lost.

My question is: Is there any way to find out those text positions? An answer requiring me to change the parser would also be acceptable.


UPDATE / SOLUTION:

For further reference, here's Will P's answer, applied to the original code:

def tagsoupParser = new org.ccil.cowan.tagsoup.Parser()
def slurper = new XmlParser(tagsoupParser)
def htmlParsed = slurper.parseText(stringToParse)

println htmlParsed.body.div[0].children().collect {it in String ? it : null}

This yields:

[Text I would like to get1., null, Text I would like to get2., null, Text I would like to get3.]

One has to use XmlParser instead of XmlSlurper with node.children().


Solution

  • I don't know jsoup, and i hope it is not interfering with the solution, but with a pure XmlParser you can get an array of children() which contains the raw string:

    html = '''<html>
        <body>
            <div>
                Text I would like to get1.
                <a href="http://intro.com">extra stuff</a>
                Text I would like to get2.
                <a href="http://example.com">link to example</a>
                Text I would like to get3.
            </div>
            <span>
                <a href="http://intro.com">extra stuff</a>
                Text I would like to get2.
                <a href="http://example.com">link to example</a>
                Text I would like to get3.
            </span>
        </body>
    </html>'''
    
    def root = new XmlParser().parseText html
    
    root.body.div[0].children().with {
        assert get(0).trim() == 'Text I would like to get1.'
        assert get(0).getClass() == String
    
        assert get(1).name() == 'a'
        assert get(1).getClass() == Node
    
        assert get(2) == '''
                Text I would like to get2.
                '''
    }