When i try to do CTRL+U on website then also its more then what i get from jsoup. The site am using is Open SAP -> https://open.sap.com/courses Have tried timeout and maxbodysize along with jsoup.connect. Right now my code looks like this:
private static String getHtml(String location) throws IOException {
URL url = new URL(location);
URLConnection conn = url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String input;
StringBuilder builder = new StringBuilder();
while((input = in.readLine()) != null)
{
builder.append(input);
}
return builder.toString();
}
document = Jsoup.parse(getHtml(URL));
But still same HTML returned. By selenium its possible but it a bit slow so any other way to achieve this? Because by aim is to find out the links of the courses and then load them to find their course summary which with selenium will be too slow.
Please suggest what can be done here.
The page content of this page is constructed inside your browser based on js. You need a framework with js support to do this.
Using HtmlUnit i got the page like this
String url = "https://open.sap.com/courses";
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_68)) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScriptStartingBefore(10_000);
System.out.println("-------------------------------");
System.out.println(page.asText());
System.out.println("-------------------------------");
}
HtmlUnit has a rich API to do everything you like with the page object like searching for controls/content, clicking controls or extracting the text from parts of the page.