I'm trying to get the details of the latest release in the information box on the right side. I'm trying to retrieve "6.2 (Build 9200) / August 1, 2012; 7 years ago
" from the box by scraping this page using jsoup.
I have code that pulls all data from the box but I can't figure out how to pull the specific part of the box.
org.jsoup.Connection.Response res = Jsoup.connect("https://en.wikipedia.org/wiki/Windows_Server_2012").execute();
String html = res.body();
Document doc2 = Jsoup.parseBodyFragment(html);
Element body = doc2.body();
Elements tables = body.getElementsByTag("table");
for (Element table : tables) {
if (table.className().contains("infobox")==true) {
System.out.println(table.outerHtml());
break;
}
}
You can query for the table row that contains a link that ends with Software_release_life_cycle
:
String url = "https://en.wikipedia.org/wiki/Windows_Server_2012";
try {
Document document = Jsoup.connect(url).get();
Elements elements = document.select("tr:has([href$=Software_release_life_cycle])");
for (Element element: elements){
System.out.println(element.text());
}
}
catch (IOException e) {
//exception handling
}
This is why, by looking at the full html, I found out that the row you need (and only the row you need -this is a vital detail!-) is formed like this. Infact elements
will actually contain only an Element
.
Finally you extract only the text. This code will print:
Latest release 6.2 (Build 9200) / August 1, 2012; 7 years ago (2012-08-01)[2]
If you need even more refinement you can always substring
it.
Hope I helped!