Search code examples
jsouphtml-parser

jsoup output wrong HTML when < exists inside text


Input html is

<p>猫<虎</p>

Which can be displayed by Chrome as 猫<虎

But when you use jsoup to parse the html, then output html is

<p>猫
  <虎 < p>
  </虎<></p>

How can I fix this problem without modify the

< to &lt;

Solution

  • Why do you think that jsoup is "wrong" and chrome is "right"? An < that is not part of a tag should always be escaped as &lt; (because it will otherwise be interpreted as opening a tag) - fix that, and all standards-compliant html tools will agree on the same parsing. Do not fix it, and some may disagree. In this case, JSoup is accepting non-alphanumerics as tag-name, which is invalid. But it encountered an unescaped < that was not part of a tag-name!

    If you insist on not changing the source html, you can simply pre-process it before feeding it into JSoup:

     // before 
     Document doc = Jsoup.parse(html);
    
     // with pre-processing
     Document doc = Jsoup.parse(fixOutOfTagLessThan(html));
    

    where

     /**
      * Replaces not-in-tag `<` by `&lt;`, but WILL FAIL in 
      * many cases, because it is unaware of:
      * - comments (<!--)
      * - javascript
      * - the fact that you should NOT PARSE HTML WITH REGEX
      */
     public static void fixOutOfTagLessThan(String html) {
        return html.replaceAll("<([^</>]+)<", "&lt;$1<");
     }
    

    Chrome appears to be applying HTML5 parse logic to treat the < as text (since it is not part of a valid tag name) - however, as I understand it, it should reject everything up to the >, and then issue a missing </p>. So, to my eyes, it does not appear to follow the standard fully either.