java web-scraping html-content-extraction

Using java to extract a single value from an html page:

I am continuing work on a project that I've been at for some time now, and I have been struggling to pull some data from a website. The website has an iframe that pulls in some data from an unknown source. The data is in the iframe in a tag something like this:

<DIV id="number_forecast"><LABEL id="lblDay">9,000</LABEL></DIV>

There is a BUNCH of other crap above it but this div id / label is totally unique and is not used anywhere else in the code.

Solution

jsoup is probably what you want, it excels at extracting data from an HTML document.

There are many examples available showing how to use the API: http://jsoup.org/cookbook/extracting-data/selector-syntax

The process will be in two steps:

parse the page and find the url of the iframe
parse the content of the iframe and extract the information you need

The code would look like this:

 // let's find the iframe
 Document document = Jsoup.parse(inputstream, "iso-8859-1", url);
 Elements elements = document.select("iframe");
 Element iframe = elements.first();

 // now load the iframe
 URL iframeUrl = new URL(iframe.absUrl("src"));
 document = Jsoup.parse(iframeUrl, 15000);

 // extract the div
 Element div = document.getElementById("number_forecast");