Search code examples
javagwthtmlunitgooglebotgoogle-crawlers

HtmlUnit to take snapshot of Ajax applications


I create a basic GWT (Google Web Toolkit) Ajax application, and now I'm trying to create snapshots to the crawlers read the page.

I create a Servlet to response the crawlers, using HtmlUnit.

My application runs perfectly when I'm on a browser. But when in HtmlUnit, it throws a lot of errors about the special chars I have in the HTML. But these chars are content, and I wouldn't like to replace it with the special codes, once it's currently working, just because of the HtmlUnit. (at least I should check before if I'm using HtmlUnit correctly )

My page with the error

I think HtmlUnit should read the charset information of the page and render it as a browser, once it's the objective of the project I think.

I haven't found good information about this problem. Is this an HtmlUnit limitation? Do I need to change all the content of my website to use this java library to take snapshots?

Here's my code:

if ((queryString != null) && (queryString.contains("_escaped_fragment_"))) {
            // ok its the crawler
            // rewrite the URL back to the original #! version
            // remember to unescape any %XX characters

            url = URLDecoder.decode(url, "UTF-8");

            String ajaxURL = url.replace("?_escaped_fragment_=", "#!");


            final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_24);


            HtmlPage page = webClient.getPage(ajaxURL);

            // important!  Give the headless browser enough time to execute JavaScript
            // The exact time to wait may depend on your application.
            webClient.waitForBackgroundJavaScript(3000);

            // return the snapshot
            response.getWriter().write(page.asXml());

Solution

  • The problem was XML confliting with the HTML. @ColinAlworth comments helped me.

    I followed Google example, and there was not working.

    To it work, you need to remove XML tags and let just the HTML be responded, changing the line:

     // return the snapshot
     response.getWriter().write(page.asXml());
    

    to

     response.getWriter().write(page.asXml().replaceFirst("<\\?.*>",""));
    

    Now it's rendering.

    But although it is being rendered, the CSS is ot working, and the DOM is not updated (GWT updates page title when page opens). HTMLUnit throwed a lot of errors about CSS, and I'm using twitter bootstrap without any changes. Apparently, HtmlUnit project have a lot of bugs, good for small tests, but not to parse complex (or even simple) HTMLs.