Search code examples
javascriptjavahtmldomhtmlunit

HtmlUnit Login attempt leads to a weird page I can't get past. "Script is disabled. Click Submit to continue"


TLDR:

I login with HtmlUnit headless browser, site redirects me to a page where I have to click on the submit button to continue, can't find its element in HtmlUnit. Can't find a way to click on it to go to the desired page after login. This page is not there with regular human login.

Background

My school has a learning environment where we subscribe to courses to download lesson material and such.

As I just started learning Java for a course, I figured I could try and see if I could make a java application that logs in and just fetches all lesson material for me.

I must note that this learning environment requires a login from a Microsoft environment that resembles that of outlook but is customized for universities. Perhaps that gives a clue as to what the page I land on is supposed to be.

What I tried

I took a look into HtmlUnit, seemed like the headless browser could work to accomplish my login goal at least. I set up a WebClient and navigated to the page.

Like so:

    final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_60);
    webClient.getOptions().setJavaScriptEnabled(true);
    webClient.getCookieManager().setCookiesEnabled(true);
    webClient.getOptions().setRedirectEnabled(true);
    HtmlPage page = webClient.getPage(LOGIN_FORM_URL);

All went well, I got to the login page and isolated the form and filled the input with my credentials:

    HtmlForm form = page.getForms().get(0);        
    HtmlEmailInput username =  form.getInputByName("UserName");
    HtmlPasswordInput pass =  form.getInputByName("Password"); 
    HtmlElement buttonElement = form.getElementsByTagName("span").get(1);
    username.setValueAttribute(USERNAME);
    pass.setValueAttribute(PASSWORD);      

    HtmlPage page2 = buttonElement.click();

The Problem

I expected to be redirected to the learning environment, instead I got a weird page. This is the structure when printed with page2.asXml() :

<html>
 <head>
  <title>
   Working...
  </title>
 </head>
 <body>
  <form method="POST" name="hiddenform" action="https://engine.surfconext.nl:443/authentication/sp/consume-assertion">
    <input type="hidden" name="SAMLResponse" value="PHNhbWxwOl.... An insanely long value />
    <noscript>
      <p>Script is disabled. Click Submit to continue.</p><input type="submit" value="Submit" />
    </noscript>
  </form>
  <script language="javascript">
  //<![CDATA[
    window.setTimeout('document.forms[0].submit()', 0);
  //]]>
  </script>
 </body>
</html>

I can not for the life of me figure out how to click on the input between the noscript tags.

I tried to find the submit input with getElementsByTagName so I could simulate a click on it, but it doesn't even seem to recognize that it is there. When I used getChildElementCount() on the noscript tag, it returned 0.

Do I need to do something special to get past this page?


Solution

  • I think this question is quite broad to be answered, but as you provide further information and findings I will update the answer.

    Disclaimer: This answer is for educational purpose only. I'm not willing to help you build a web scraper. At least not for free ;)

    The page you landed on is an anti-scraper page, built with the purpose to prevent automated systems to login to that page. This implies two thinks:

    • Your fake browser has been detected (even if you are connecting from a conventional ip)
    • They are trying to block you.

    This can make you understand that there may be other of this techniques along the path to prevent you from proceeding, but it is worth a try.

    First of all, you may have been detected only due to a poor HTTP header setup, try to change BrowserVersion, or even try to reproduce a HTTP headers of your real browser.

    If it does not work, we are quite easy to go here, as the form nor the input are wrapped in a <noscript> tag (here I'm telling you SURFspot how to improove), so you can parse the form method and action attributes and input name and value then you need only to produce a fake post request as the next step (so you are not clicking on the button but rather faking what will happen if you would be able to...

    So, produce a form post with the correct values to the right URL. Check if they have set you cookies (if so copy them as well) and set the correct value to the realm header (they may be checking that as well) and the doors shall open.