Search code examples
javaweb-scrapingjsoup

Java JSoup: article extraction with image links and paragraph


I am currently making an article content extraction application using Jsoup and Java. My problem is when I scrape the article, Jsoup tends to return a list of Element rather than preserves the order of the article. For example, in an normal article with more than 1 image, it could has an order like this: (Title, sapo, image, paragraph, image, paragraph, paragraph, image, paragraph). So how can I scrape the main content of the website (text and image links) without losing its order? Below is my idea for doing that but it doesn't work.

int cur = 0;
Document doc = Jsoup.connect(url).get();
Elements elements = doc.select("div");
for (Element element : elements) {
    if (element.select("div[type=\"Photo\"] img").hasAttr("src")) {
        Elements temp = element.select("div[type=\"Photo\"] img");
        System.out.println(temp.get(cur).attr("src"));
        cur++;
    }
    System.out.println(element.select("p span").text());
    System.out.println("");
}

Solution

  • If you wanted to extract the article data from the sites that you linked to in the comment, you could do something like this:

    Document doc = Jsoup.connect(url).get();
    
    // Full article
    Elements elements = doc.select("div.sidebar-1");
    
    System.out.println("## Article title:");
    System.out.println(elements.select("h1.title-detail").text());
    
    System.out.println("## Article summary:");
    System.out.println(elements.select("p.description").text());
    
    // Images and paragraphs
    for (Element e : elements.select("article.fck_detail p,figure")) {
        if (e.is("p")) {
            System.out.println("## Paragraph");
            System.out.println(e.text());
        } else {
            System.out.println("## Image (image URL)");
            System.out.println(e.select("img[src]").attr("src"));
        }
    }
    

    The idea is this one:

    1. find the outermost container that contains the full article
    2. extract title and the summary
    3. loop through the image (figure) and paragraph (p) elements of the article - the order will be preserved automatically