Search code examples
ajaxgwtseohtmlunit

HtmlUnit with GWT returns incomplete page


I'm trying to use HtmlUnit to test that my GWT website loads properly.

Unfortunately, the page I'm fetching doesn't seem complete. It is missing content which is viewable when I visit the page in my normal browser.

Here's my unit test that is producing this output:

WebClient webClient = new WebClient();
webClient.setThrowExceptionOnScriptError(false);

webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.waitForBackgroundJavaScript(30000);
HtmlPage page = webClient.getPage("http://www.ozdroid.com/#!BLOG/2010/10/12/How_to_Make_Google_AppEngine_Applications_Ajax_Crawlable");

System.out.println(page.asXml());
webClient.closeAllWindows();

Does anyone have any idea what I can do to get around this and fetch the full Html of the site?

Edit

Here's what the page.asXml() returns with the updated code, which is clearly incomplete:

<?xml version="1.0" encoding="ISO-8859-1"?>
<html xmlns:fb="http://www.facebook.com/2008/fbml>
&lt;head>
&lt;meta http-equiv=" content-type="">
  <head>
    <meta name="google-site-verification" content="_KCG8ec0LvgmXjnBAikAog0knc7jAbIGCu8Cmu2hsCI"/>
    <meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7"/>
    <link rel="shortcut icon" href="favicon.ico"/>
    <link rel="icon" type="image/gif" href="favicon.gif"/>
    <title>
      OzDroid - Enterprise Solutions for Android | Laser Barcode
scanners | RFID | Handheld Computers | Rugged PDA's and Mobile Phones
    </title>
    <script type="text/javascript">
//<![CDATA[
var _gaq = _gaq || [];
//]]>
    </script>
    <script type="text/javascript" language="javascript" src="ozdroid/ozdroid.nocache.js">
    </script>
    <script defer="defer">
//<![CDATA[
ozdroid.onInjectionDone('ozdroid')
//]]>
    </script>
    <script src="http://www.google-analytics.com/ga.js" type="text/javascript">
    </script>
  </head>
  <body>
    <!-- OPTIONAL: include this if you want history support -->    <iframe src="javascript:''" id="__gwt_historyFrame" style="position: absolute; width: 0; height: 0; border: 0">
    </iframe>
    <noscript>

&lt;div
    style="width: 22em; position: absolute; left: 50%; margin-left: -11em; color: red; background-color: white; border: 1px solid red; padding: 4px; font-family: sans-serif"&gt;
&lt;p&gt;Welcome, to the website of OzDroid, we sell and distribute rugged Android
 handheld computers, pda's and mobile phones. These devices can be equipped 
 with options including 1D and 2D laser barcode scanners, RFID, wifi,
  bluetooth and cameras.&lt;/p&gt;
 &lt;p&gt; In the near future, we also
 will be supplying logistics software for the same.
&lt;/p&gt;
&lt;p&gt;As this site contains dynamic content that relies on javascript,
 &lt;b&gt;your web browser must have JavaScript enabled&lt;/b&gt; in order for this site to
display correctly.
&lt;/p&gt;&lt;/div&gt;

    </noscript>
    <div id="fb-root">
    </div>
    <!-- Production -->    <script src="http://connect.facebook.net/en_GB/all.js">
    </script>
  </body>
</html>

Thanks


Solution

  • Cuga the website you are are trying to fetch is mine, it was basically a bit of overkill when I was learning some GWT stuff and wanted to make the site crawlable. The idea was to make a simple blog so that I could have dynamic content crawled. The blog articles are fetched from appengine datastore using RPC calls, so it was a useful test.

    The full HTML is served by the site by complying with Googles Ajax crawling standards and replacing #! with ?_escaped_fragment_= .

    The address below should fetch the page from App Engine

    Link

    All the work done to generate the HTML snapshot on the appengine server is done by HTMLUnit. So its not likely to be a HTMLUnit bug.

    Unfortunately some of the facebook type stuff is now broken - I suspect due to to API changes - but to be honest I really haven't looked as I have other priorities.

    AS I haven't touched this for over two years I am a bit rusty...

    TRY THIS

    put the line...

    webClient.waitForBackgroundJavaScript(30000);
    

    after getting the page. I think waitForBackgroundJavaScript() is supposed to block the thread you are on until all the javascript has ran. Calling it before you fetch the page probably does nothing.