Search code examples
groovyxml-parsingscreen-scrapinggpath

Groovy XML parsing (HTML slurping), can't get my specific case to work


Okay, here's what I'm looking for.

I want to go into the DOM and look for an <a id> starting with "thread_title_". Here are a couple things I've tried:

// setup
def slurper = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser())
def gurl = new URL("url")
gurl.withReader { gReader ->

  def try1 = gHTML.body.find { it['@id'].startsWith("thread_title_") }
  // fails: Caught: groovy.lang.MissingMethodException: No signature of method: groovy.util.slurpersupport.Attributes.startsWith() is applicable for argument types: (java.lang.String) values: [thread_title_]

  def try2 = gHTML.body.find { it['@id'] =~ /thread_title_/ }
  // fails: Caught: groovy.lang.MissingMethodException: No signature of method: groovy.util.slurpersupport.Attributes.startsWith() is applicable for argument types: (java.lang.String) values: [thread_title_]

  def try3 = gHTML.body.find { it['@id'].name.startsWith("thread_title_") }
  // fails: Caught: groovy.lang.MissingMethodException: No signature of method: groovy.util.slurpersupport.NodeChildren.startsWith() is applicable for argument types: (java.lang.String) values: [thread_title_]

  def try4 = gHTML.body.find { it['@id'] == 'thread_title_745429' }
  // doesn't fail, but doesn't return anything either

  def try5 = gHTML.body.findAll { it.name() == 'a' && [email protected]('thread_title_') }
  try5.eachWithIndex { row, i ->
    println "rn: $i"
  }
  // no output

}

Here is the gdoc for Attributes. I don't really want "name", I want "value". The gpath page implies that node.character.find { it['@id'] == '2' } works, which seems much like find..startsWith to me. This stackoverflow answer is similar, but the startsWith is different and seems to throw a wrench into the whole thing. The fifth entry was inspired by this stackoverflow answer.

And if you are concerned it's a problem with the input data: $ curl --silent http://www.advrider.com/forums/forumdisplay.php?f=18 | grep thread_title | wc -l 43

Here is some sample output, using the curl | grep above.

<a href="foo" id="thread_title_705760">text</a>
<a href="foo" id="thread_title_753701">text</a>

I have Groovy 1.7.10 installed. I could go newer, don't know if it would help.


Solution

  • How this?

    @Grab( 'org.ccil.cowan.tagsoup:tagsoup:1.2.1' )
    import org.ccil.cowan.tagsoup.Parser
    
    def gHTML = new URL( 'http://www.advrider.com/forums/forumdisplay.php?f=18' ).withReader { r ->
      new XmlSlurper( new Parser() ).parse( r )
    }
    
    def allLinks = gHTML.body.'**'.findAll { it.name() == 'a' && [email protected]().startsWith( 'thread_title_' ) }
    allLinks.each { link ->
      println "${link.text()} -> ${link.@href}"
    }
    

    Let me know if you have any problems or questions :-)