(This is implemented and works in most cases as expected.)
There is one case that fails:
<div someattribute="somevalue"=""></div>
<div someattribute="somevalue" =""=""></div>
"=""
out of the HTML string and replace it with "
, but why do it myself when there is a library that can parse invalid HTML!?Unfortunately the HTML document I want to get parsed by JSoup contains something like this snippet:
<div someattribute="somevalue"=""></div>
Calling JSoup with this configuration ...
Document doc = Jsoup.parse(html);
doc.outputSettings().syntax(Document.OutputSettings.Syntax.xml).charset(StandardCharsets.UTF_8);
String html = doc.html();
... returns an HTML document that contains this snippet:
<div someattribute="somevalue" =""=""></div>
XPath then aborts parsing this document with this message:
Auf Elementtyp "div" müssen entweder Attributspezifikationen, ">" oder "/>" folgen.
In English this is something like this:
Element type "div" must be followed by either attribute specifications, ">" or "/>".
jsoup includes a converter to the W3C DOM model, which includes attribute filtering when converting. You can then run xpath queries on that object directly, which will not only work, but will be more efficient than serializing to XML and then re-parsing it.
See the documentation for org.jsoup.helper.W3CDom
Here's an example:
import org.w3c.dom.Document;
import org.w3c.dom.Node;
...
String html = "<div someattribute=\"somevalue\"=\"\"></div>";
org.jsoup.nodes.Document jdoc = Jsoup.parse(html);
Document w3doc = W3CDom.convert(jdoc);
String query = "//div";
XPathExpression xpath = XPathFactory.newInstance().newXPath().compile(query);
Node div = (Node) xpath.evaluate(w3doc, XPathConstants.NODE);
System.out.printf("Tag: %s, Attribute: %s",
div.getNodeName(),
div.getAttributes().getNamedItem("someattribute"));
(Note that Document
and Node
here are W3C DOM, not the jsoup DOM.)
That gives us:
Tag: div, Attribute: someattribute="somevalue"