Search code examples
htmlparsinggroovyxmlslurper

Using XmlSlurper: How to select sub-elements while iterating over a GPathResult


I am writing an HTML parser, which uses TagSoup to pass a well-formed structure to XMLSlurper.

Here's the generalised code:

def htmlText = """
<html>
<body>
<div id="divId" class="divclass">
<h2>Heading 2</h2>
<ol>
<li><h3><a class="box" href="#href1">href1 link text</a> <span>extra stuff</span></h3><address>Here is the address<span>Telephone number: <strong>telephone</strong></span></address></li>
<li><h3><a class="box" href="#href2">href2 link text</a> <span>extra stuff</span></h3><address>Here is another address<span>Another telephone: <strong>0845 1111111</strong></span></address></li>
</ol>
</div>
</body>
</html>
"""     

def html = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText( htmlText );

html.'**'.grep { it.@class == 'divclass' }.ol.li.each { linkItem ->
    def link = linkItem.h3.a.@href
    def address = linkItem.address.text()
    println "$link: $address\n"
}

I would expect the each to let me select each 'li' in turn so I can retrieve the corresponding href and address details. Instead, I am getting this output:

#href1#href2: Here is the addressTelephone number: telephoneHere is another addressAnother telephone: 0845 1111111

I've checked various example on the web and these either deal with XML, or are one-liner examples like "retrieve all links from this file". It's seems that the it.h3.a.@href expression is collecting all hrefs in the text, even though I'm passing it a reference to the parent 'li' node.

Can you let me know:

  • Why I'm getting the output shown
  • How I can retrieve the href/address pairs for each 'li' item

Thanks.


Solution

  • Replace grep with find:

    html.'**'.find { it.@class == 'divclass' }.ol.li.each { linkItem ->
        def link = linkItem.h3.a.@href
        def address = linkItem.address.text()
        println "$link: $address\n"
    }
    

    then you'll get

    #href1: Here is the addressTelephone number: telephone
    
    #href2: Here is another addressAnother telephone: 0845 1111111
    

    grep returns an ArrayList but find returns a NodeChild class:

    println html.'**'.grep { it.@class == 'divclass' }.getClass()
    println html.'**'.find { it.@class == 'divclass' }.getClass()
    

    results in:

    class java.util.ArrayList
    class groovy.util.slurpersupport.NodeChild
    

    thus if you wanted to use grep you could then nest another each like this for it to work

    html.'**'.grep { it.@class == 'divclass' }.ol.li.each {
        it.each { linkItem ->
            def link = linkItem.h3.a.@href
            def address = linkItem.address.text()
            println "$link: $address\n"
        }
    }
    

    Long story short, in your case, use find rather than grep.