I'm working currently on a java desktop app for a company and they ask me, to extract the 5 last articles from a web page and and to display them in the app. To do this I need a html parser of course and I thought directly about JSoup. But my problem is how do i do it exactly? I found one easy example from this question: Example: How to “scan” a website (or page) for info, and bring it into my program?
with this code:
package com.stackoverflow.q2835505;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Test {
public static void main(String[] args) throws Exception {
String url = "https://stackoverflow.com/questions/2835505";
Document document = Jsoup.connect(url).get();
String question = document.select("#question .post-text").text();
System.out.println("Question: " + question);
Elements answerers = document.select("#answers .user-details a");
for (Element answerer : answerers) {
System.out.println("Answerer: " + answerer.text());
}
}
}
this code was written by BalusC and i understand it, but how do i do it when the links are not fixed, which is the case in most newspaper for example. For the sake of simplicity, how would i go to extract for example the 5 last articles from this news page: News? I can't use a rss feed as my boss wants the complete articles to be displayed.
First you need to download the main page:
Document doc = Jsoup.connect("https://globalnews.ca/world/").get();
Then you select links you are interested in for example with css selectors
You select all a
tags that contains href
with text globalnews
and are nested in h3
tag with class story-h
. Urls are in href
attribute of a
tag.
for(Element e: doc.select("h3.story-h > a[href*=globalnews]")) {
System.out.println(e.attr("href"));
}
Then the resulting urls you can process as you wish. You can download content of the first five of then using syntax from the first line etc.