Search code examples
javaweb-scrapinghtml-content-extraction

Using java to extract a single value from an html page:


I am continuing work on a project that I've been at for some time now, and I have been struggling to pull some data from a website. The website has an iframe that pulls in some data from an unknown source. The data is in the iframe in a tag something like this:

<DIV id="number_forecast"><LABEL id="lblDay">9,000</LABEL></DIV>

There is a BUNCH of other crap above it but this div id / label is totally unique and is not used anywhere else in the code.


Solution

  • jsoup is probably what you want, it excels at extracting data from an HTML document.

    There are many examples available showing how to use the API: http://jsoup.org/cookbook/extracting-data/selector-syntax

    The process will be in two steps:

    • parse the page and find the url of the iframe
    • parse the content of the iframe and extract the information you need

    The code would look like this:

     // let's find the iframe
     Document document = Jsoup.parse(inputstream, "iso-8859-1", url);
     Elements elements = document.select("iframe");
     Element iframe = elements.first();
    
     // now load the iframe
     URL iframeUrl = new URL(iframe.absUrl("src"));
     document = Jsoup.parse(iframeUrl, 15000);
    
     // extract the div
     Element div = document.getElementById("number_forecast");