Search code examples
javahtmljsoup

Extract only HTML tags and attributes from a html string using Jsoup


I want to fetch only the HTML content along with the attributes and remove the text.

Input String:

String html = "<p>An <br/><b></b> <a href='http://example.com/' target=\"h\"> <b> example <a><p></b>this is  the </a> link </p>";

Output

<p><br></br><b></b><a href="http://example.com/" target="h"><b><a><p></p></a></b></a></p>

Edit: Most of the questions in google or stackoverflow are only related to removing the html and extract text only. I spent around 3 hours to come across the below mentioned solutions. So posting it here as it will help others


Solution

  • Hope this helps someone like me looking to remove only the text content from the HTML string.

    Output

    <p><br></br><b></b><a href="http://example.com/" target="h"><b><a><p></p></a></b></a></p>
    
    String html = "<p>An <br/><b></b> <a href='http://example.com/' target=\"h\"> <b> example <a><p></b>this is  the </a> link </p>";
           Traverser traverser = new Traverser();
    
           Document document = Jsoup.parse(html, "", Parser.xmlParser());// you can use the html parser as well. which will add the html tags
    
           document.traverse(traverser);
           System.out.println(traverser.extractHtmlBuilder.toString());
    

    By appending the node.attributes will includes all the attributes.

        public static class Traverser implements NodeVisitor {
    
            StringBuilder extractHtmlBuilder = new StringBuilder();
    
            @Override
            public void head(Node node, int depth) {
                if (node instanceof Element && !(node instanceof Document)) {
                    extractHtmlBuilder.append("<").append(node.nodeName()).append(node.attributes()).append(">");
                }
            }
    
            @Override
            public void tail(Node node, int depth) {
                if (node instanceof Element && !(node instanceof Document)) {
                    extractHtmlBuilder.append("</").append(node.nodeName()).append(">");
                }
            }
        }
    
    

    Another Solution:

     Document document = Jsoup.parse(html, "", Parser.xmlParser());
            for (Element element : document.select("*")) {
                if (!element.ownText().isEmpty()) {
                    for (TextNode node : element.textNodes())
                        node.remove();
                }
            }
            System.out.println(document.toString());