I want to create a whitelist to remove all html tags except head, body and i in a data. To do that I used Safelist class and jsoup library.
Safelist safe_list = Safelist.none();
safe_list.addTags(new String[] { "head", "body", "i"});
String data = "<head>Title here</head>
<body>
<p><b> paragraph 1</b></p>
<p><i> paragraph 2</i></p>
</body>";
String cleaned_data = Jsoup.clean(data,safe_list);
System.out.println(cleaned_data);
The expected result was
<head>
Title here
</head>
<body>
paragraph 1 <i>paragraph 2</i>
</body>
but the result I got
<body>
Title here paragraph 1 <i>paragraph 2</i>
</body>
Although head tag in the allowed list, it is removed from the data unlike body and i tag. What is the problem with head tag and what should I do to keep it in a data?
I found a solution. It may not be exact solution but it works in my case. The Jsoup official website has the following information:
The cleaner and these safelists assume that you want to clean a body fragment of HTML (to add user supplied HTML into a templated page), and not to clean a full HTML document. If the latter is the case, either wrap the document HTML around the cleaned body HTML, or create a safelist that allows html and head elements as appropriate.
Because creating a safelist that allows html and head elements as appropriate doesn't work, I took the first suggestion:
Safelist safe_list = Safelist.none();
safe_list.addTags(new String[] {"body", "i"});
String data = "<body>
<p><b> paragraph 1</b></p>
<p><i> paragraph 2</i></p>
</body>";
String cleaned_data = Jsoup.clean(data,safe_list);
cleaned_data = '<head>Title here</head>' + cleaned_data
System.out.println(cleaned_data);