I'm trying to parse an HTML page and produce an object which is the equivalent of what I am seeing when I inspect the HTML using a browsers Web Developers Tools. (Firefox or Chrome).
The web page contains some URL's which are download links, but I'd like to collate a list of all the download links on the page. I've tried a number of methods to get to this but each method only gives an object with only part of the information. The best I have managed so far is using Tagsoup.
def tagsoupParser = new org.ccil.cowan.tagsoup.Parser()
def slurper = new XmlSlurper(tagsoupParser)
def htmlParser = slurper.parse("https://somewebsite")
htmlParser.'**'.findAll{it}.each {
new File("myFile.txt") << XmlUtil.serialize( it )
}
This works to an extent, but annoyingly the HTML stops just before the part where the URLs are, in the web page, and I don't get the full HTML. If you use the Web Developer Tools in a browser, and compare the Inspection view with what's in myFile.txt, it compares pretty well up to a point where the HTML in the file just stops, and I miss whole chunk of HTML. (The bits I need!).
I also tried this code, which gave a similar result :-
def parser = new org.cyberneko.html.parsers.SAXParser()
new XmlParser( parser ).parse( 'https:somewebsite' ).with { page ->
page.'**'.DIV.grep {it}.each { it ->
new File("myFile.txt") << XmlUtil.serialize( it )
}
}
So this didn't work either. I've tried a few other methods too, but none have given what I want and they all fall short of the two I've detailed above.
Embedded in the web page is this line, and it is lines like this I need to extract (among others like it) :-
<a target="_blank" href="somedocument" class="p2n-btn p2n-btn-inverse p2n-btn-download p2n-bottom-info-link p2n-btn-fs-mini p2n" title="Download file"> </a>
I'm using Grails 2.5.6 / Groovy but am happy to use native Java if that works.
If you look at the source code of that page I don't believe it contains any of the links you're looking for.
If you load it in a browser with the network tab open, you can see that once it has loaded the main page, it then makes a second request (via javascript) to
I believe this contains the HTML you are looking for