Search code examples
javajsoupwhitelist

Java Safelist Add <head> Tag to Allowed List


I want to create a whitelist to remove all html tags except head, body and i in a data. To do that I used Safelist class and jsoup library.

Safelist safe_list = Safelist.none();
safe_list.addTags(new String[] { "head", "body", "i"});
String data = "<head>Title here</head>
               <body>
                  <p><b> paragraph 1</b></p>
                  <p><i> paragraph 2</i></p>
               </body>";
String cleaned_data = Jsoup.clean(data,safe_list); 
System.out.println(cleaned_data);

The expected result was

<head>
 Title here
</head>
<body>
 paragraph 1 <i>paragraph 2</i>
</body>

but the result I got

<body>
 Title here paragraph 1 <i>paragraph 2</i>
</body>

Although head tag in the allowed list, it is removed from the data unlike body and i tag. What is the problem with head tag and what should I do to keep it in a data?


Solution

  • I found a solution. It may not be exact solution but it works in my case. The Jsoup official website has the following information:

    The cleaner and these safelists assume that you want to clean a body fragment of HTML (to add user supplied HTML into a templated page), and not to clean a full HTML document. If the latter is the case, either wrap the document HTML around the cleaned body HTML, or create a safelist that allows html and head elements as appropriate.

    Because creating a safelist that allows html and head elements as appropriate doesn't work, I took the first suggestion:

    Safelist safe_list = Safelist.none();
    safe_list.addTags(new String[] {"body", "i"});
    String data = "<body>
                      <p><b> paragraph 1</b></p>
                      <p><i> paragraph 2</i></p>
                   </body>";
    String cleaned_data = Jsoup.clean(data,safe_list); 
    cleaned_data  = '<head>Title here</head>' + cleaned_data 
    System.out.println(cleaned_data);
    

    https://jsoup.org/apidocs/org/jsoup/safety/Safelist.html