Input html is
<p>猫<虎</p>
Which can be displayed by Chrome as 猫<虎
But when you use jsoup to parse the html, then output html is
<p>猫
<虎 < p>
</虎<></p>
How can I fix this problem without modify the
< to <
Why do you think that jsoup is "wrong" and chrome is "right"? An <
that is not part of a tag should always be escaped as <
(because it will otherwise be interpreted as opening a tag) - fix that, and all standards-compliant html tools will agree on the same parsing. Do not fix it, and some may disagree. In this case, JSoup is accepting non-alphanumerics as tag-name, which is invalid. But it encountered an unescaped <
that was not part of a tag-name!
If you insist on not changing the source html, you can simply pre-process it before feeding it into JSoup:
// before
Document doc = Jsoup.parse(html);
// with pre-processing
Document doc = Jsoup.parse(fixOutOfTagLessThan(html));
where
/**
* Replaces not-in-tag `<` by `<`, but WILL FAIL in
* many cases, because it is unaware of:
* - comments (<!--)
* - javascript
* - the fact that you should NOT PARSE HTML WITH REGEX
*/
public static void fixOutOfTagLessThan(String html) {
return html.replaceAll("<([^</>]+)<", "<$1<");
}
Chrome appears to be applying HTML5 parse logic to treat the <
as text (since it is not part of a valid tag name) - however, as I understand it, it should reject everything up to the >
, and then issue a missing </p>
. So, to my eyes, it does not appear to follow the standard fully either.