Search code examples
parsinggroovyrssxmlslurper

Parse RSS with groovy


I am trying to parse RSS feeds with groovy. I just wanted to extract the title and description tags' value. I used following code snippet to achieve this:

rss = new XmlSlurper().parse(url)
            rss.channel.item.each {
            titleList.add(it.title)
            descriptionList.add(it.description)
            }

After this, I am accessing these values in my JSP page. What is going wrong is the value of description that I am getting is not just of<description> (child of <channel>) but also of<media:description> (another optional child of <channel>). What can I change to only extract the value of<description> and omit the value of <media:description>?

Edit: To duplicate this behavior, you can execute following code on this website: http://www.tutorialspoint.com/execute_groovy_online.php

 def url = "http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"
 rss = new XmlSlurper().parse(url)
 rss.channel.item.each {
    println"${it.title}"
    println"${it.description}"
}

You will see that the media description tag is also being printed in the console.


Solution

  • You can tell XmlSlurper and XmlParser to not try to handle namespaces in the constructor. I believe this does what you are after:

    'http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml'.toURL().withReader { r ->
        new XmlSlurper(false, false).parse(r).channel.item.each {
            println it.title
            println it.description
        }
    }