Search code examples
javahtmlcss-selectorsjsoup

JSoup - Select Images not inside a link


How can I select all images that are not inside of a link element?

document.select("a img"); //selects all images inside a link
document.select(":not(a) img"); //images not inside a link (does not work)

Solution

  • Ok, so the problem here is that :not(a) img needs just one element around the <img> which is not an <a> containing an <img>. For example <body> matches for :not(a). So your selector matches nearly all <img> tags. Even if you pass an HTML string to Jsoup.parse() which doesn't have a <body> or <html> tag. Jsoup automatically generates it.

    Let's assume we have the following HTML:

    <html>
      <body>
        <a><div><img id="a-div-img"></div></a>
        <a><img id="a-img"></a>
        <img id="img">
      </body>
    </html>
    

    If you just want to exclude direct <img> childs in <a> you can use :not(a) > img as selector:

    Elements images = document.select(":not(a) > img");
    

    The result will be this:

    <img id="a-div-img">
    <img id="img">
    

    The problem with this is that it also prints the first <img> of the example, which is actually inside an <a> (#a-div-img). If this in enough to fit your needs you can go with this solution.

    Excluding all <a> tags from the selection is not possible with a pure CSS (at least I didn't find a solution yet). But you can just remove all <a> tags from the document before selecting all <img> tags:

    document.select("a").remove();
    Elements images = document.select("img");
    

    The result will be just this:

    <img id="img">
    

    If you need the original document without modifications you can use Document.clone() before:

    Document tempDocument = document.clone();
    tempDocument.select("a").remove();
    Elements images = tempDocument.select("img");
    

    Using this the original document is never modified.