Search code examples
javahtmljsoup

Jsoup select elements that their children do not contain a specific tag


I'm using Jsoup to extract links from a web page, but i want to avoid img links. so the following code:

Document doc = Jsoup.connect(i_Url).userAgent("chrome/5.0").get();
Elements links = doc.select("a[href]");

will get me all the links, but some of them are images. doing the following:

links.stream().filter(link -> !link.tagName().equals("img"));

won't work because the element's (=link) child is the one with the img tag, for instance:

<a href="index.htm" title="tutorialspoint">
  <img alt="tutorialspoint" src="/java/images/logo.png">
</a>

I tried all sorts of things, such as:

Elements links = doc.select("a[href]").select(":not(img)"); //or
Elements links = doc.select("a[href]:not(img)"); //or
Elements links = doc.select("a[href]")
links.stream().filter(link -> link.children().contains(Tag.valueOf("img")));

I just tried to play with all kinds of variations, none of them worked. I'm not a big expert when it comes to html. Help would be appreciated. Thanks


Solution

  • Use following selector:

    a[href]:not(:has(img))
    

    I have just tested it with following unit test, works like a charm:

    @Test
    public void testParsingLinksWithoutImagesInside() {
        //given:
        String html = "<a href=\"index.htm\" title=\"tutorialspoint\">\n" +
                "  <img alt=\"tutorialspoint\" src=\"/java/images/logo.png\">\n" +
                "</a>";
    
        //when:
        Document document = Jsoup.parse(html);
        Elements elements = document.select("a[href]:not(:has(img))");
    
        //then:
        assertThat(elements.size()).isEqualTo(0);
    }
    

    I hope it helps :)