Search code examples
javajsoup

JSoup - How to extract only the href in paragraph


I have the following:

</div>
  <p>
    <a href="https://urlIwant.com" data-wpel-link="internal">
      <span class="image-holder" style="padding-bottom:149.92679355783%;">
        <img loading="lazy" src="https://urlIwant.com" width="683" height="1024" class="alignnone size-full wp-image-200816" />
      </span>
    </a>
  </p>
  <p>
    <span id="more-20000"></span>
  </p>
  <p>
    <a href="https://urlIwant.com" data-wpel-link="internal">
      <span class="image-holder" style="padding-bottom:149.92679355783%;">
        <img loading="lazy" src="https://urlIwant.com" width="683" height="1024" class="alignnone size-full wp-image-200833" />
      </span>
    </a>
  </p>
  <p>
    <a href="https://urlIwant.com" data-wpel-link="internal">
      <span class="image-holder" style="padding-bottom:145.71428571429%;">
        <img loading="lazy" src="https://urlIwant.com" width="700" height="1020" class="alignnone size-medium wp-image-200834" sizes="(max-width: 700px) 100vw, 700px" />
      </span>
    </a>
  </p>
  <p>
    <a href="https://urlIwant.com" data-wpel-link="internal">
      <span class="image-holder" style="padding-bottom:143.42857142857%;">
        <img loading="lazy" src="https://urlIwant.com" width="700" height="1004" class="alignnone size-medium wp-image-200835" 836w" sizes="(max-width: 700px) 100vw, 700px" />
      </span>
    </a>
  </p>
</div>

How can I extract all of the urls that contain the paragraph tag, href and contains the class "image-holder"?

I can't figure out how to add the span class

try {
    Document doc = Jsoup.connect("https://urltoextractfrom.com").get();
    Elements selections = doc.select("p a[href]");
    for (Element e : selections) {
        System.out.println(e);
    }
} catch (Exception e) {
    e.printStackTrace();
}

Solution

  • If I have understood what you want to extract correctly, you can use this selector:

    p a:has(span.image-holder)
    

    That finds all the a elements which descend from a p element, and which contain a span with class image-holder set.

    So in code:

    Document document = Jsoup.parse(html);
    Elements links = document.select("p a:has(span.image-holder)");
    List<String> urls = links.eachAttr("href");
    

    You can use the try.jsoup REPL to quickly iterate on selectors. https://try.jsoup.org/~wvd2VHaJtnr10qEiLS9g_-E6UA8

    (If there's content this selects that you don't want to, you can clarify that in your question with examples.)