Search code examples
javaxmldomsaxhtmlcleaner

How to get a clean xml representation from a website url


I'm trying to get a clean representation of a website url, so I can put the 'html' inside a

org.w3c.dom.Document

to be able to do further processing with xpath and so on.

What I get, when I try to put the html inside a document is :

org.xml.sax.SAXParseException : Elementtyp "link" muss mit dem entsprechenden Endtag "" beendet werden

which means, that "link" has to be closed, what isn't the case in this website.

So, could be the right approach ? Should I 'fix' the document and replace errors ?

I tried net.sourceforge.htmlcleaner but it didn't figure out, how to 'fix' the errors.

Any help ?

Regards, Holger


Solution

  • HTML is usually not xml, so Document can not process it. You need a special library like JSoup