I am trying to extract data from a couple of websites using JAVA. I am connecting to a website that has a table. I need to extract the value from td elements.
The thing is that: - when I inspect the element in the browser I can see the element and its value in the source. - When I view the source code in the browser I get the JS.
I am using URL from JAVA jdk 1.8 and when the code below runs I get the unrendered JS instead of the elements the site shows when you visit it.
import java.net.URL;
import java.net.URLConnection;
URL url = new URL("https://www.example.com");
URLConnection conn = url.openConnection();
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)");
BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream(), "UTF-8"));
while ((f = in.readLine()) != null) {
builder.append(f);
}
alltext = builder.toString();
if (alltext.contains("<td colspan="1">Something</td>")) {
...Do something
}
The reason is that the element your saw were created by javascript
and you can not get these element directly.
In order to get the element data,you need to parse it only after the javascript finished creating elements.
Two solutions for you:
Note: it will need you to write more code and cost more time to do it,there is not easy choice in this case