Groovy htmlunit

I'm having issues importing htmlunit (htmlunit.sf.net) into a groovy script.

I'm currently just using the example script that was on the web and it gives me unable to resolve class com.gargoylesoftware.htmlunit.WebClient

The script is:

import com.gargoylesoftware.htmlunit.WebClient

client = new WebClient()
html = client.getPage('http://www.msnbc.msn.com/')
println page.anchors.collect{ it.hrefAttribute }.sort().unique().join('\n')

I downloaded the source from the website and placed the com folder (and all its contents) where my script was located.

Does anyone know what issue I'm encountering? I'm not quite sure why it won't import it

Solution

You could use Grape to get the dependecy for you during script runtime. Easiest way to do it is to add a @Grab annotation to your import statement.

Like this:

@Grab('net.sourceforge.htmlunit:htmlunit:2.7')
import com.gargoylesoftware.htmlunit.WebClient

client = new WebClient()

// Added as HtmlUnit had problems with the JavaScript
client.javaScriptEnabled = false
html = client.getPage('http://www.msnbc.msn.com/')
println page.anchors.collect{ it.hrefAttribute }.sort().unique().join('\n')

There's only one problem. The page seems to be a little bit to much to chew off for HtmlUnit. When I ran the code I got OutOfMemoryException every time. I'd suggest downloading the html the normal way instead and then using something like NekoHtml or TagSoup to parse the html into XML and work with it that way.

This example uses TagSoup to work with html as xml in Groovy: http://blog.foosion.org/2008/06/09/parse-html-the-groovy-way/