Search code examples
javascriptjavahtmlhtmlunit

Java HtmlUnit - receiving empty href when scraping website


I am currently attempting a project to send a url to multiple websites to scan them for categorisation and any security risks using java and HtmlUnit. www.virustotal.com is the last website I have to configure and I am having issues progressing through the site due to a href being empty.

The site works by entering a URL into the first page and then clicking submit. From here a popup is shown and the user has to select whether to re-analyse or use the last scan results (in this case we want to always re-analyse). It is the re-analyse anchor that is providing the empty href. My thoughts are that this is a javascript issue with it not generating the URL to the results page. Unfortunately I am unsure of where to go next :/

Project Code (apologies for how scruffy it is!):-

//turn off htmlunit logging//
    java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(java.util.logging.Level.OFF);
    java.util.logging.Logger.getLogger("org.apache.http").setLevel(java.util.logging.Level.OFF);
    java.util.logging.Logger.getLogger("org.apache.http.client.protocol.ResponseProcessCookies").setLevel(java.util.logging.Level.OFF);

    //initialise url and obtain users selection//
    System.out.println("Please select the url you would like to review:");
    Scanner sc = new Scanner(System.in);
    String startPath = sc.nextLine();

    //enable javascript and use engine to initialise and parse websites code//
    String url = "https://www.virustotal.com/#url";
    System.out.println("Connecting to Virus Total...");
    WebClient webClient = new WebClient(BrowserVersion.CHROME);
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    webClient.getOptions().setJavaScriptEnabled(true);
    webClient.waitForBackgroundJavaScript(8000);
    page = webClient.getPage(url);

    //fill in form
    HtmlForm form = page.getFirstByXPath("//form[@action='/en/url/submission/']");
    HtmlTextInput textField = form.getInputByName("url");
    textField.setValueAttribute(startPath);
    HtmlButton button1 = page.getFirstByXPath("//button[@id='btn-scan-url']");
    HtmlPage page1 = button1.click();

    //waiting and dealing with popup
    webClient.waitForBackgroundJavaScript(8000);
    String page1String = page1.getWebResponse().getContentAsString();
    System.out.println(page1String);
    HtmlAnchor htmlAnchor = page1.getFirstByXPath("//button[@id='btn-url-reanalyse']");
    System.out.println(htmlAnchor); //testing what I can see in the anchor
    HtmlPage page2 = htmlAnchor.click();

    //progressing to next screen
    String output = page2.asText();
    System.out.println(output);

HTML I receive when I print out string page1String:

<div class="modal-footer">
  <a id="btn-url-reanalyse" class="btn" href="">
    Reanalyse
  </a>

HTML when manually progressing through site:

<a id="btn-url-reanalyse" class="btn" href="/en/url/submission/?force=1&amp;url=http%3A//www.facebook.com/&amp;token=415eda59daae48938b1dcc64f3152ed5ee9ac27d485348d55c87e9da7e714605">
    Reanalyse
  </a>

Any help or advice would be greatly appreciated! I am also happy to work with any module recommendations that are provided, simply using HtmlUnit as it was one of the first I found that actually worked with other sites.

Thanks in advance.


Solution

  •  java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(java.util.logging.Level.OFF);
    

    I think disable the logging is a bad idea while hunting for a problem. If you enable logging you will see that there is a js error.

    webClient.getOptions().setThrowExceptionOnScriptError(false);
    

    Because of this the program continues but parts of the javascript are not executed. I guess that's the reason why your link does not get updated.

    The Javascript error looks like a HtmlUnit bug. Please open an issue and isolate a minimal testcase as described here.