Search code examples
javahtmlweb-scrapingjsoup

Get specific information from Wikipedia Information Box


I'm trying to get the details of the latest release in the information box on the right side. I'm trying to retrieve "6.2 (Build 9200) / August 1, 2012; 7 years ago" from the box by scraping this page using jsoup.

I have code that pulls all data from the box but I can't figure out how to pull the specific part of the box.

org.jsoup.Connection.Response res = Jsoup.connect("https://en.wikipedia.org/wiki/Windows_Server_2012").execute();
String html = res.body();
Document doc2 = Jsoup.parseBodyFragment(html);
Element body = doc2.body();
Elements tables = body.getElementsByTag("table");
for (Element table : tables) {
    if (table.className().contains("infobox")==true) {
        System.out.println(table.outerHtml());
        break;
    }
}

Solution

  • You can query for the table row that contains a link that ends with Software_release_life_cycle:

    String url = "https://en.wikipedia.org/wiki/Windows_Server_2012";
    try {
        Document document = Jsoup.connect(url).get();
        Elements elements = document.select("tr:has([href$=Software_release_life_cycle])");
        for (Element element: elements){
            System.out.println(element.text());
        }
    }
    catch (IOException e) {
        //exception handling
    }
    

    This is why, by looking at the full html, I found out that the row you need (and only the row you need -this is a vital detail!-) is formed like this. Infact elements will actually contain only an Element.

    Finally you extract only the text. This code will print:

    Latest release 6.2 (Build 9200) / August 1, 2012; 7 years ago (2012-08-01)[2]
    

    If you need even more refinement you can always substring it.

    Hope I helped!

    ( selector syntax reference )