Search code examples
javaweb-scrapingjsoup

Remove nodes that don't contain their own text using Jsoup


I notice a lot of web pages have superfluous (for my purposes) html nodes. I'd like to remove them from the page, as it would make my processing a lot easier.

Is there a way to do it with JSoup?

To make the situation more clear, let's say we have the following page:

<html>
  <head>
  </head>
  <body>
    <div>I have some text</div>
    <div class='useless'>
      <div class='useless'>
        <div>I also have text
          <div>I also have text</div>
        </div>
      </div>   
    </div>      
  </body>
</html>

I'd like to remove the class='useless' divs - but of course I can't select them by their class/id/tag etc only by the fact that they have no content. This will of course change the structure of the page, that's totally fine - it will make my final processing easier.

The result would be:

<html>
  <head>
  </head>
  <body>
    <div>I have some text</div>
    <div class='useless'>
      <div class='useless'>
        <div>I also have text
          <div>I also have text</div>
        </div>
      </div>
    </div>   
  </body>
</html>

Is this possible in an easy or hard way.

The result would be:

<html>
  <head>
  </head>
  <body>
    <div>I have some text</div>
    <div>I also have text
      <div>I also have text</div>
    </div>  
  </body>
</html>

Right now I can't think of anything particularly elegant. My general inclination is to check the ownText() method on the various elements (will check ownText().length() > 0) and if false try to remove them, but I think that will remove any sub/child elements as well, even if they match true for an .ownText() condition.


Solution

  • You can use Document.getAllElements() and check each element if it has ownText(). If it has do nothing. If not, append all children to the parent node if there is one. This should do the job:

    Document document = Jsoup.parse(html);
    document.getAllElements().stream()
            .filter(e -> e.ownText().isEmpty())
            .filter(Element::hasParent)
            .forEach(e -> {
                e.children().forEach(e.parent()::appendChild);
                e.remove();
            });
    

    The result of the code you shared will be this:

    <div>
     I have some text
    </div>
    <div>
     I also have text 
     <div>
      I also have text
     </div> 
    </div>
    

    As I mentioned in the comments with your ownText() rule the html, head and body element should be also removed.


    If you want to prevent some special tags from being removed you can use a simple Set or List which contains the tag names, which should be retained:

    Set<String> retainTagNames = new HashSet<>(Arrays.asList("html", "body"));
    Document document = Jsoup.parse(html);
    document.getAllElements().stream()
            .filter(e -> ! retainTagNames.contains(e.tagName()))
            .filter(e -> e.ownText().isEmpty())
            .filter(Element::hasParent)
            .forEach(e -> {
                e.children().forEach(e.parent()::appendChild);
                e.remove();
            });
    

    The result of this will be:

    <html>
     <head> 
     </head> 
     <body> 
      <div>
       I have some text
      </div>   
      <div>
       I also have text 
       <div>
        I also have text
       </div> 
      </div>
     </body>
    </html>