Search code examples
javahtmlgroovygrails

Unable to fully parse an HTML page with Groovy / Grails


I'm trying to parse an HTML page and produce an object which is the equivalent of what I am seeing when I inspect the HTML using a browsers Web Developers Tools. (Firefox or Chrome).

The web page contains some URL's which are download links, but I'd like to collate a list of all the download links on the page. I've tried a number of methods to get to this but each method only gives an object with only part of the information. The best I have managed so far is using Tagsoup.

    def tagsoupParser = new org.ccil.cowan.tagsoup.Parser()
    def slurper = new XmlSlurper(tagsoupParser)
    def htmlParser = slurper.parse("https://somewebsite")

    htmlParser.'**'.findAll{it}.each {
        new File("myFile.txt") << XmlUtil.serialize( it )
    }

This works to an extent, but annoyingly the HTML stops just before the part where the URLs are, in the web page, and I don't get the full HTML. If you use the Web Developer Tools in a browser, and compare the Inspection view with what's in myFile.txt, it compares pretty well up to a point where the HTML in the file just stops, and I miss whole chunk of HTML. (The bits I need!).

I also tried this code, which gave a similar result :-

    def parser = new org.cyberneko.html.parsers.SAXParser()
    new XmlParser( parser ).parse( 'https:somewebsite' ).with { page ->
        page.'**'.DIV.grep {it}.each { it ->
            new File("myFile.txt") << XmlUtil.serialize( it )
            
        }
    }

So this didn't work either. I've tried a few other methods too, but none have given what I want and they all fall short of the two I've detailed above.

Embedded in the web page is this line, and it is lines like this I need to extract (among others like it) :-

    <a target="_blank" href="somedocument" class="p2n-btn p2n-btn-inverse p2n-btn-download p2n-bottom-info-link p2n-btn-fs-mini p2n" title="Download file"> </a>

I'm using Grails 2.5.6 / Groovy but am happy to use native Java if that works.


Solution

  • If you look at the source code of that page I don't believe it contains any of the links you're looking for.

    If you load it in a browser with the network tab open, you can see that once it has loaded the main page, it then makes a second request (via javascript) to

    https://www.2n.com/en_US/c/portal/render_portlet?p_l_id=618795&p_p_id=101_INSTANCE_ONpHHwoLjEag&p_p_lifecycle=0&p_t_lifecycle=0&p_p_state=normal&p_p_mode=view&p_p_col_id=column-1&p_p_col_pos=1&p_p_col_count=2&p_p_isolated=1&currentURL=%2Fen_US%2Fweb%2F2n%2Fsupport%2Fdocuments%2Ffirmware&_101_INSTANCE_ONpHHwoLjEag_2n-custom-params=2n-search-document&_101_INSTANCE_ONpHHwoLjEag_2n-search-document=2N%C2%AE%20IP%20Verso%202.0%2C2N%C2%AE%20IP%20Style%2C2N%C2%AE%20IP%20Verso%2C2N%C2%AE%20LTE%20Verso%2C2N%C2%AE%20IP%20Solo%2C2N%C2%AE%20IP%20Force%2C2N%C2%AE%20IP%20Safety%2C2N%C2%AE%20IP%20Base%2C2N%C2%AE%20IP%20Audio%20Kit%2C2N%C2%AE%20Induction%20Loop%2C2N%C2%AE%202Wire%2CNVT%20PoLRE%20LPC%20Switch%2C&portletAjaxable=1

    I believe this contains the HTML you are looking for