I want to fetch only the HTML content along with the attributes and remove the text.
Input String:
String html = "<p>An <br/><b></b> <a href='http://example.com/' target=\"h\"> <b> example <a><p></b>this is the </a> link </p>";
Output
<p><br></br><b></b><a href="http://example.com/" target="h"><b><a><p></p></a></b></a></p>
Edit: Most of the questions in google or stackoverflow are only related to removing the html and extract text only. I spent around 3 hours to come across the below mentioned solutions. So posting it here as it will help others
Hope this helps someone like me looking to remove only the text content from the HTML string.
Output
<p><br></br><b></b><a href="http://example.com/" target="h"><b><a><p></p></a></b></a></p>
String html = "<p>An <br/><b></b> <a href='http://example.com/' target=\"h\"> <b> example <a><p></b>this is the </a> link </p>";
Traverser traverser = new Traverser();
Document document = Jsoup.parse(html, "", Parser.xmlParser());// you can use the html parser as well. which will add the html tags
document.traverse(traverser);
System.out.println(traverser.extractHtmlBuilder.toString());
By appending the node.attributes will includes all the attributes.
public static class Traverser implements NodeVisitor {
StringBuilder extractHtmlBuilder = new StringBuilder();
@Override
public void head(Node node, int depth) {
if (node instanceof Element && !(node instanceof Document)) {
extractHtmlBuilder.append("<").append(node.nodeName()).append(node.attributes()).append(">");
}
}
@Override
public void tail(Node node, int depth) {
if (node instanceof Element && !(node instanceof Document)) {
extractHtmlBuilder.append("</").append(node.nodeName()).append(">");
}
}
}
Another Solution:
Document document = Jsoup.parse(html, "", Parser.xmlParser());
for (Element element : document.select("*")) {
if (!element.ownText().isEmpty()) {
for (TextNode node : element.textNodes())
node.remove();
}
}
System.out.println(document.toString());