I create a basic GWT (Google Web Toolkit) Ajax application, and now I'm trying to create snapshots to the crawlers read the page.
I create a Servlet to response the crawlers, using HtmlUnit.
My application runs perfectly when I'm on a browser. But when in HtmlUnit, it throws a lot of errors about the special chars I have in the HTML. But these chars are content, and I wouldn't like to replace it with the special codes, once it's currently working, just because of the HtmlUnit. (at least I should check before if I'm using HtmlUnit correctly )
I think HtmlUnit should read the charset information of the page and render it as a browser, once it's the objective of the project I think.
I haven't found good information about this problem. Is this an HtmlUnit limitation? Do I need to change all the content of my website to use this java library to take snapshots?
Here's my code:
if ((queryString != null) && (queryString.contains("_escaped_fragment_"))) {
// ok its the crawler
// rewrite the URL back to the original #! version
// remember to unescape any %XX characters
url = URLDecoder.decode(url, "UTF-8");
String ajaxURL = url.replace("?_escaped_fragment_=", "#!");
final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_24);
HtmlPage page = webClient.getPage(ajaxURL);
// important! Give the headless browser enough time to execute JavaScript
// The exact time to wait may depend on your application.
webClient.waitForBackgroundJavaScript(3000);
// return the snapshot
response.getWriter().write(page.asXml());
The problem was XML confliting with the HTML. @ColinAlworth comments helped me.
I followed Google example, and there was not working.
To it work, you need to remove XML tags and let just the HTML be responded, changing the line:
// return the snapshot
response.getWriter().write(page.asXml());
to
response.getWriter().write(page.asXml().replaceFirst("<\\?.*>",""));
Now it's rendering.
But although it is being rendered, the CSS is ot working, and the DOM is not updated (GWT updates page title when page opens). HTMLUnit throwed a lot of errors about CSS, and I'm using twitter bootstrap without any changes. Apparently, HtmlUnit project have a lot of bugs, good for small tests, but not to parse complex (or even simple) HTMLs.