Search code examples

HtmlUnit to take snapshot of Ajax applications

I create a basic GWT (Google Web Toolkit) Ajax application, and now I'm trying to create snapshots to the crawlers read the page.

I create a Servlet to response the crawlers, using HtmlUnit.

My application runs perfectly when I'm on a browser. But when in HtmlUnit, it throws a lot of errors about the special chars I have in the HTML. But these chars are content, and I wouldn't like to replace it with the special codes, once it's currently working, just because of the HtmlUnit. (at least I should check before if I'm using HtmlUnit correctly )

My page with the error

I think HtmlUnit should read the charset information of the page and render it as a browser, once it's the objective of the project I think.

I haven't found good information about this problem. Is this an HtmlUnit limitation? Do I need to change all the content of my website to use this java library to take snapshots?

Here's my code:

if ((queryString != null) && (queryString.contains("_escaped_fragment_"))) {
            // ok its the crawler
            // rewrite the URL back to the original #! version
            // remember to unescape any %XX characters

            url = URLDecoder.decode(url, "UTF-8");

            String ajaxURL = url.replace("?_escaped_fragment_=", "#!");

            final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_24);

            HtmlPage page = webClient.getPage(ajaxURL);

            // important!  Give the headless browser enough time to execute JavaScript
            // The exact time to wait may depend on your application.

            // return the snapshot


  • The problem was XML confliting with the HTML. @ColinAlworth comments helped me.

    I followed Google example, and there was not working.

    To it work, you need to remove XML tags and let just the HTML be responded, changing the line:

     // return the snapshot



    Now it's rendering.

    But although it is being rendered, the CSS is ot working, and the DOM is not updated (GWT updates page title when page opens). HTMLUnit throwed a lot of errors about CSS, and I'm using twitter bootstrap without any changes. Apparently, HtmlUnit project have a lot of bugs, good for small tests, but not to parse complex (or even simple) HTMLs.