I'm trying to get a clean representation of a website url, so I can put the 'html' inside a
org.w3c.dom.Document
to be able to do further processing with xpath and so on.
What I get, when I try to put the html inside a document is :
org.xml.sax.SAXParseException : Elementtyp "link" muss mit dem entsprechenden Endtag "" beendet werden
which means, that "link" has to be closed, what isn't the case in this website.
So, could be the right approach ? Should I 'fix' the document and replace errors ?
I tried net.sourceforge.htmlcleaner but it didn't figure out, how to 'fix' the errors.
Any help ?
Regards, Holger
HTML is usually not xml, so Document can not process it. You need a special library like JSoup