Search code examples
javahtmlweb-scrapingjsoup

get all links from a div with JSoup


Basically, I am using Jsoup to parse a site, I want to get all the links from the following html:

<ul class="detail-main-list">
  <li> 
    <a href="/manga/toki_wa/v01/c001/1.html" title="Toki wa... Vol.01 Ch.001 -Toki wa... target="_blank"> Dis Be the link</a>
   </li> 
</ul>

Any idea how?


Solution

  • Straight from jsoup.org, right there, first thing you see:

    Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
    log(doc.title());
    Elements newsHeadlines = doc.select("#mp-itn b a");
    for (Element headline : newsHeadlines) {
      log("%s\n\t%s", 
        headline.attr("title"), headline.absUrl("href"));
    }
    

    Modifying this to what you need seems trivial:

    Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
    Elements anchorTags = doc.select("ul.detail-main-list a");
    for (Element anchorTag : anchorTags) {
      System.out.println("Links to: " + anchorTag.attr("href"));
      System.out.println("In absolute form: " + anchorTag.absUrl("href"));
      System.out.println("Text content: " + anchorTag.text());
    }
    

    The ul.detail-main-list a part is a so-called selector string. A real short tutorial on these:

    • foo means: Any HTML element with that tag name, i.e. <foo></foo>.
    • .bar means: Any HTML element with class bar, i.e. <foo class="bar baz"></foo>
    • #bar means: Any HTML element with id bar, i.e. <foo id="bar">
    • These can be combined: ul.detail-main-list matches any <ul> tags that have the string detail-main-list in their list of classes.
    • a b means: all things matching the 'b' selection that have something matching 'a' as a parent. So ul a matches all <a> tags that have a <ul> tag around them someplace.

    The JSoup docs are excellent.