I notice a lot of web pages have superfluous (for my purposes) html nodes. I'd like to remove them from the page, as it would make my processing a lot easier.
Is there a way to do it with JSoup?
To make the situation more clear, let's say we have the following page:
<html>
<head>
</head>
<body>
<div>I have some text</div>
<div class='useless'>
<div class='useless'>
<div>I also have text
<div>I also have text</div>
</div>
</div>
</div>
</body>
</html>
I'd like to remove the class='useless' divs - but of course I can't select them by their class/id/tag etc only by the fact that they have no content. This will of course change the structure of the page, that's totally fine - it will make my final processing easier.
The result would be:
<html>
<head>
</head>
<body>
<div>I have some text</div>
<div class='useless'>
<div class='useless'>
<div>I also have text
<div>I also have text</div>
</div>
</div>
</div>
</body>
</html>
Is this possible in an easy or hard way.
The result would be:
<html>
<head>
</head>
<body>
<div>I have some text</div>
<div>I also have text
<div>I also have text</div>
</div>
</body>
</html>
Right now I can't think of anything particularly elegant. My general inclination is to check the ownText()
method on the various elements (will check ownText().length() > 0
) and if false
try to remove them, but I think that will remove any sub/child elements as well, even if they match true
for an .ownText()
condition.
You can use Document.getAllElements()
and check each element if it has ownText()
. If it has do nothing. If not, append all children to the parent node if there is one. This should do the job:
Document document = Jsoup.parse(html);
document.getAllElements().stream()
.filter(e -> e.ownText().isEmpty())
.filter(Element::hasParent)
.forEach(e -> {
e.children().forEach(e.parent()::appendChild);
e.remove();
});
The result of the code you shared will be this:
<div>
I have some text
</div>
<div>
I also have text
<div>
I also have text
</div>
</div>
As I mentioned in the comments with your ownText()
rule the html
, head
and body
element should be also removed.
If you want to prevent some special tags from being removed you can use a simple Set
or List
which contains the tag names, which should be retained:
Set<String> retainTagNames = new HashSet<>(Arrays.asList("html", "body"));
Document document = Jsoup.parse(html);
document.getAllElements().stream()
.filter(e -> ! retainTagNames.contains(e.tagName()))
.filter(e -> e.ownText().isEmpty())
.filter(Element::hasParent)
.forEach(e -> {
e.children().forEach(e.parent()::appendChild);
e.remove();
});
The result of this will be:
<html>
<head>
</head>
<body>
<div>
I have some text
</div>
<div>
I also have text
<div>
I also have text
</div>
</div>
</body>
</html>