Search code examples
javahtml-parsingjsoup

remove empty tag pairs from HTML fragment


I have a user-submitted string that contains HTML content such as

"<p></p><div></div><p>Hello<br/>world</p><p></p>"

I would like to transform this string such that empty tag pairs are removed (but empty tags like <br/> are retained). For example, the result of this transformation should convert the string above to

"<p>Hello<br/>world</p>"

I'd like to use JSoup to do this, as I already have this on my classpath, and it would be easiest for me to perform this transformation on the server-side.


Solution

  • Here is an example that does just that (using JSoup):

    String html = "<p></p><div></div><p>Hello<br/>world</p><p></p>";
    Document doc = Jsoup.parse(html);
    
    for (Element element : doc.select("*")) {
        if (!element.hasText() && element.isBlock()) {
            element.remove();
        }
    }
    
    System.out.println(doc.body().html())
    

    The output of the code above is what you are looking for:

    <p>Hello<br />world</p>