Search code examples
htmltextjsoupextract

Removing text enclosed between HTML tags using JSoup


In some cases of HTML cleaning, I would like to retain the text enclosed between the tags(which is the default behaviour of Jsoup) and in some cases, I would like to remove the text as well as the HTML tags. Can someone please throw some light on how I can remove the text enclosed between the HTML tags using Jsoup?


Solution

  • The Cleaner will always drop tags and preserve text. If you need to drop elements (i.e. tags and text / nested elements), you can pre-parse the HTML, remove the elements using either remove() or empty(), then run the resulting through the cleaner.

    For example:

    String html = "Clean <div>Text dropped</div>";
    Document doc = Jsoup.parse(html);
    doc.select("div").remove();
    
    // if not removed, the cleaner will drop the <div> but leave the inner text
    String clean = Jsoup.clean(doc.body().html(), Whitelist.basic());
    

    If you are using JSoup 1.14.1+ then use Safelist instead of Whitelist, as Whitelist has been deprecated and will be removed in 1.15.1.

    String clean = Jsoup.clean(doc.body().html(), Safelist.basic());