Basically, I am using Jsoup to parse a site, I want to get all the links from the following html:
<ul class="detail-main-list">
<li>
<a href="/manga/toki_wa/v01/c001/1.html" title="Toki wa... Vol.01 Ch.001 -Toki wa... target="_blank"> Dis Be the link</a>
</li>
</ul>
Any idea how?
Straight from jsoup.org, right there, first thing you see:
Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
log(doc.title());
Elements newsHeadlines = doc.select("#mp-itn b a");
for (Element headline : newsHeadlines) {
log("%s\n\t%s",
headline.attr("title"), headline.absUrl("href"));
}
Modifying this to what you need seems trivial:
Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
Elements anchorTags = doc.select("ul.detail-main-list a");
for (Element anchorTag : anchorTags) {
System.out.println("Links to: " + anchorTag.attr("href"));
System.out.println("In absolute form: " + anchorTag.absUrl("href"));
System.out.println("Text content: " + anchorTag.text());
}
The ul.detail-main-list a
part is a so-called selector string. A real short tutorial on these:
foo
means: Any HTML element with that tag name, i.e. <foo></foo>
..bar
means: Any HTML element with class bar
, i.e. <foo class="bar baz"></foo>
#bar
means: Any HTML element with id bar
, i.e. <foo id="bar">
ul.detail-main-list
matches any <ul>
tags that have the string detail-main-list
in their list of classes.a b
means: all things matching the 'b' selection that have something matching 'a' as a parent. So ul a
matches all <a>
tags that have a <ul>
tag around them someplace.The JSoup docs are excellent.