Search code examples
javaweb-crawlerhtmlunit

How to show all AJAX requests with HtmlUnit


I want to get list of all network calls of web page. This is the page's url

https://www.upwork.com/o/jobs/browse/?q=Java&sort=renew_time_int%2Bdesc

You if look into DeveloperConsole->Network you will see the following list enter image description here

This is my code:

public static void main(String[] args) throws IOException {
        final WebClient webClient = configWebClient();
        final List<String> list = new ArrayList<>();
        new WebConnectionWrapper(webClient) {
            @Override
            public WebResponse getResponse(final WebRequest request) throws IOException {
                final WebResponse response = super.getResponse(request);
                list.add(request.getUrl().toString());
                return response;
            }
        };
        webClient.getPage("https://www.upwork.com/ab/find-work/");
        list.forEach(System.out::println); 
    }

    private static WebClient configWebClient() {
        WebClient webClient = new WebClient(BrowserVersion.FIREFOX_60);

        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.waitForBackgroundJavaScriptStartingBefore(5_000);
        webClient.waitForBackgroundJavaScript(3_000);
        webClient.getOptions().setCssEnabled(false);
        webClient.getOptions().setRedirectEnabled(true);
        webClient.getOptions().setUseInsecureSSL(false);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getCookieManager().setCookiesEnabled(true);
        webClient.setAjaxController(new AjaxController());
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getCookieManager().setCookiesEnabled(true);
        return webClient;
    }

This is the output

https://www.upwork.com/o/jobs/browse/?q=Java&sort=renew_time_int%2Bdesc
https://www.upwork.com/o/jobs/browse/?q=Java
https://www.upwork.com:443/o/jobs/browse/js/328ecc3.js?4af40b2
https://www.googletagmanager.com/gtm.js?id=GTM-5XK7SV
https://client.perimeterx.net/PXSs13U803/main.min.js
https://assets.static-upwork.com/components/11.4.0/core.11.4.0.air2.min.js
https://assets.static-upwork.com/global-components/@latest/ugc.js
https://assets.static-upwork.com/global-components/@latest/ugc/ugc.6jcmqb32.js
https://www.upwork.com:443/static/jsui/JobSearchUI/assets/4af40b2/js/55260a3.js

As you can see it doesn't contains xhr calls. What am I doing wrong?


Solution

  • Your qestions uses two different URL's; hope i have used the right one

    • as mentioned many times here; .waitForBackground... methods are not options, you have to call them AFTER the invocation of some web requests
    • the A in AJAX stands for async; webClient.getPage() is a sync call, means you have to wait for all the javascript to finish
    • calling the page seems to produce some js errors when using HtmlUnit. Maybe this will lead to not execute all the javascript code in this page (there are still some javascript features not supported by HtmlUnit (Rhino); any help is welcome)

      public static void main(String[] args) throws IOException {
          final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_60);
          webClient.getOptions().setThrowExceptionOnScriptError(false);
      
          final List<String> list = new ArrayList<>();
      
          new WebConnectionWrapper(webClient) {
              @Override
              public WebResponse getResponse(final WebRequest request) throws IOException {
                  final WebResponse response = super.getResponse(request);
                  list.add(request.getHttpMethod() + " " + request.getUrl());
                  return response;
              }
          };
      
          webClient.getPage("https://www.upwork.com/o/jobs/browse/?q=Java&sort=renew_time_int%2Bdesc");
          webClient.waitForBackgroundJavaScript(10_000);
          list.forEach(System.out::println); 
      }