Search code examples
javahtmljsoupxsshtmlcleaner

Using jsoup to escape disallowed tags


I am evaluating jsoup for the functionality which would sanitize (but not remove!) the non-whitelisted tags. Let's say only <b> tag is allowed, so the following input

foo <b>bar</b> <script onLoad='stealYourCookies();'>baz</script>

has to yield the following:

foo <b>bar</b> &lt;script onLoad='stealYourCookies();'&gt;baz&lt;/script&gt;

I see the following problems/questions with jsoup:

  • document.getAllElements() always assumes <html>, <head> and <body>. Yes, I can call document.body().getAllElements() but the point is that I don't know if my source is a full HTML document or just the body -- and I want the result in the same shape and form as it came in;
  • how do I replace <script>...</script> with &lt;script&gt;...&lt;/script&gt;? I only want to replace brackets with escaped entities and do not want to alter any attributes, etc. Node.replaceWith sounds like an overkill for this.
  • Is it possible to completely switch off pretty printing (e.g. insertion of new lines, etc.)?

Or maybe I should use another framework? I have peeked at htmlcleaner so far, but the given examples don't suggest my desired functionality is supported.


Solution

  • Answer 1

    How do you load / parse your Document with Jsoup? If you use parse() or connect().get() jsoup will automaticly format your html (inserting html, body and head tags). This this ensures you always have a complete Html document - even if input isnt complete.

    Let's assume you only want to clean an input (no furhter processing) you should use clean() instead the previous listed methods.

    Example 1 - Using parse()

    final String html = "<b>a</b>";
    
    System.out.println(Jsoup.parse(html));
    

    Output:

    <html>
     <head></head>
     <body>
      <b>a</b>
     </body>
    </html>
    

    Input html is completed to ensure you have a complete document.

    Example 2 - Using clean()

    final String html = "<b>a</b>";
    
    System.out.println(Jsoup.clean("<b>a</b>", Whitelist.relaxed()));
    

    Output:

    <b>a</b>
    

    Input html is cleaned, not more.

    Documentation:


    Answer 2

    The method replaceWith() does exactly what you need:

    Example:

    final String html = "<b><script>your script here</script></b>";
    Document doc = Jsoup.parse(html);
    
    for( Element element : doc.select("script") )
    {
        element.replaceWith(TextNode.createFromEncoded(element.toString(), null));
    }
    
    System.out.println(doc);
    

    Output:

    <html>
     <head></head>
     <body>
      <b>&lt;script&gt;your script here&lt;/script&gt;</b>
     </body>
    </html>
    

    Or body only:

    System.out.println(doc.body().html());
    

    Output:

    <b>&lt;script&gt;your script here&lt;/script&gt;</b>
    

    Documentation:


    Answer 3

    Yes, prettyPrint() method of Jsoup.OutputSettings does this.

    Example:

    final String html = "<p>your html here</p>";
    
    Document doc = Jsoup.parse(html);
    doc.outputSettings().prettyPrint(false);
    
    System.out.println(doc);
    

    Note: if the outputSettings() method is not available, please update Jsoup.

    Output:

    <html><head></head><body><p>your html here</p></body></html>
    

    Documentation:


    Answer 4 (no bullet)

    No! Jsoup is one of the best and most capable Html library out there!