Search code examples
javahtmlunit

Downloading multiple files per page with HtmlUnit


I'm navigating through a site with HtmlUnit. It has a table, with a list of document for download. I want to click all the links and gather all the documents (don't worry, the information is public and scraping is not forbidden).

The site is written with JSF, so the links to the documents are actually <a href="#" with onclick that submits the form (but sets a hidden field to the appropriate value before that).

My code is (in scala, but that doesn't matter):

val link = row.getFirstByXPath[HtmlElement](descriptor.documentLinkPath.get)
if (link.getAttribute("href").endsWith("#")) link.setAttribute("href", "javascript:void(0)")
val documentPage: Page = link.click()
val bytes = IOUtils.toByteArray(documentPage.getWebResponse().getContentAsStream())

There's a problem, however. The first document is downloaded properly. But I can't get the 2nd one and onwards - the html page is returned, rather than the PDF document. (commenting out the # -> javascript:void(0) has no effect, I put it there because it used to blow up with some exception)

Javascript is enabled and getting it to work for the first document means that things are generally working. However, it doesn't work for the next documents. Any ideas how to resolve?


Solution

  • I'm also not able to do it without a pagereload. I think the trick is to just execute the JavaScript from the on the onclick() attribute.

    This one:

    return oamSubmitForm('broi_form','broi_form:dataTable1:4:_idJsp110',null,[['id_','3545']]);');
    

    Maybe that helps you.

    public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException
    {
        final WebClient webClient = new WebClient();
    
        HtmlPage page = webClient.getPage("http://dv.parliament.bg/DVWeb/broeveList.faces");
    
        for (HtmlAnchor link : (List<HtmlAnchor>) page.getByXPath("//table[@id='broi_form:dataTable1']//a/img/.."))
        {
            String commandString = link.getOnClickAttribute().replaceAll("return ", "");
            System.out.println(commandString);
    
            ScriptResult executeJavaScript = page.executeJavaScript(commandString);
    
            Page newPage = executeJavaScript.getNewPage();
            save(newPage.getWebResponse().getContentAsStream());
    
            page = webClient.getPage("http://dv.parliament.bg/DVWeb/broeveList.faces");
        }
    
    }
    

    But thats not the correct way of doing it...