I am currently making an article content extraction application using Jsoup and Java. My problem is when I scrape the article, Jsoup tends to return a list of Element rather than preserves the order of the article. For example, in an normal article with more than 1 image, it could has an order like this: (Title, sapo, image, paragraph, image, paragraph, paragraph, image, paragraph). So how can I scrape the main content of the website (text and image links) without losing its order? Below is my idea for doing that but it doesn't work.
int cur = 0;
Document doc = Jsoup.connect(url).get();
Elements elements = doc.select("div");
for (Element element : elements) {
if (element.select("div[type=\"Photo\"] img").hasAttr("src")) {
Elements temp = element.select("div[type=\"Photo\"] img");
System.out.println(temp.get(cur).attr("src"));
cur++;
}
System.out.println(element.select("p span").text());
System.out.println("");
}
If you wanted to extract the article data from the sites that you linked to in the comment, you could do something like this:
Document doc = Jsoup.connect(url).get();
// Full article
Elements elements = doc.select("div.sidebar-1");
System.out.println("## Article title:");
System.out.println(elements.select("h1.title-detail").text());
System.out.println("## Article summary:");
System.out.println(elements.select("p.description").text());
// Images and paragraphs
for (Element e : elements.select("article.fck_detail p,figure")) {
if (e.is("p")) {
System.out.println("## Paragraph");
System.out.println(e.text());
} else {
System.out.println("## Image (image URL)");
System.out.println(e.select("img[src]").attr("src"));
}
}
The idea is this one:
figure
) and paragraph (p
) elements of the article - the order will be preserved automatically