Search code examples
xpathnullhtmlunit

HtmlUnit getByXpath returns null


I am coding with Groovy, however, I don't believe its a language specific set of questions.

I actually have two questions

First Question

I've run into an issue while using HtmlUnit. It is telling me that what I am trying to grab is null.

The page I'm testing it on is: http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0#/dbwam4

My code:

client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false

page = client.getPage(url)

//coming up as null
title = page.getByXPath("//html/body/div[4]/div/div[3]/div/div/div/div/div/div/div/div/div/div/h1/a")

println title

This simply prints out: []

Is this because the page uses onclick()? If so, how would I get around that? Enabling javascript creates a mess in my cmd prompt.

Second Question

I am wanting to also get the image but am having trouble because when I attempt to get the XPath (via firebug) it shows up as: //*[@id="gmi-ResViewSizer_img"]

How do I handle that?


Solution

  • First Answer:

    /html/body/div[3]/div/div[3]/div/div/div/div/div/div/div/div/div/div/h1/a
    

    Your XPATH was off by one in the predicate filter for the 4th div of the body, it should be the 3rd div. It appears the HTML for the site can/does change from when you had origionally snagged the XPATH using Firebug. You may need to adjust your XPATH to accommodate for potential change and be less sensitive to some differences in document structure.

    Maybe something like this:

    /html/body//div/h1/a
    

    Second Answer: The XPATH that you listed will work. It may look odd/short(and may not be the most efficient), but // starts at the root node and looks throughout every node in the tree, * matches on any element(to include the img) and the [] predicate filter restricts it to those that have an id attribute who's value equals "gmi-ResViewSizer_img".

    There are many other options for XPATHs that could work as well. It will also depend on how often the HTML structure changes. This is one that also works for the page referenced to select that img:

    /html/body/div/div/div/div/img[1]