I'm using Jsoup to extract links from a web page, but i want to avoid img links. so the following code:
Document doc = Jsoup.connect(i_Url).userAgent("chrome/5.0").get();
Elements links = doc.select("a[href]");
will get me all the links, but some of them are images. doing the following:
links.stream().filter(link -> !link.tagName().equals("img"));
won't work because the element's (=link) child is the one with the img tag, for instance:
<a href="index.htm" title="tutorialspoint">
<img alt="tutorialspoint" src="/java/images/logo.png">
</a>
I tried all sorts of things, such as:
Elements links = doc.select("a[href]").select(":not(img)"); //or
Elements links = doc.select("a[href]:not(img)"); //or
Elements links = doc.select("a[href]")
links.stream().filter(link -> link.children().contains(Tag.valueOf("img")));
I just tried to play with all kinds of variations, none of them worked. I'm not a big expert when it comes to html. Help would be appreciated. Thanks
Use following selector:
a[href]:not(:has(img))
I have just tested it with following unit test, works like a charm:
@Test
public void testParsingLinksWithoutImagesInside() {
//given:
String html = "<a href=\"index.htm\" title=\"tutorialspoint\">\n" +
" <img alt=\"tutorialspoint\" src=\"/java/images/logo.png\">\n" +
"</a>";
//when:
Document document = Jsoup.parse(html);
Elements elements = document.select("a[href]:not(:has(img))");
//then:
assertThat(elements.size()).isEqualTo(0);
}
I hope it helps :)