Search code examples
javahttpclientjsouphtml-parsing

Page content couldn't be seen by Jsoup and HttpClient


Hi I want to scrap the information from a website so I tried to use Jsoup (also tried HttpClient) to do so. I realize that both of them couldn't "see" certain content of the html page. so when I tried to print out the parsed html, I got the empty div like this. It prints out some other div just fine.
here's my code:

Class Main{

  public static void main(String args[]) throws IOException, InterruptedException {
    
    Document doc = Jsoup.connect(url).get();
    System.out.println(doc.getElementsByClass("needed content"));
  }
}

the result in the terminal is:

<div class="needed content"></div> 

I am searching for answers on stackoverflow, some recommends using Jackson Library Java - How do I access a child of Div using JSoup

some recommend embed a browser in java Is there a way to embed a browser in Java?

some recommend using htmlunit Fail to get full content of page with JSoup

I just tried combining Jsoup with html unit, same result here's the code:

        try(WebClient wc = new WebClient()){  
        wc.getOptions().setJavaScriptEnabled(true); 
        wc.getOptions().setCssEnabled(false);  
        wc.getOptions().setThrowExceptionOnScriptError(false); 
        wc.getOptions().setTimeout(10000); 
        HtmlPage page = wc.getPage("https://chainlinklabs.com/jobs");  
        String pageXml = page.asXml();  
        

         
        Document doc2 = Jsoup.parse(pageXml, url);   
        System.out.println(doc2.getElementsByClass("needed content"));
  
        System.out.println("Thank God!"); 
        }

My interpretation of the problem is Jsoup is not showing part of the html content because it contains javascript; am I heading to the right direction?


Solution

  • There is no need (and it is a waste of resources) to re-parse the page from HtmlUnit into jsoup. All the select options are available in HtmlUnit also (see https://htmlunit.sourceforge.io/gettingStarted.html) - and maybe more.

    This simple code works for me - parts of the page are generated by an js script that starts asynchronous. Because of this you have to wait for these scripts before accessing the page.

    public static void main(String[] args) throws IOException {
        String url = "https://chainlinklabs.com/jobs";
    
        try (final WebClient webClient = new WebClient()) {
            webClient.getOptions().setThrowExceptionOnScriptError(false);
            HtmlPage page = webClient.getPage(url);
            webClient.waitForBackgroundJavaScriptStartingBefore(10_000);
    
            // System.out.println("--------------------------------");
            // System.out.println(page.asXml());
            // System.out.println("--------------------------------");
    
    
            System.out.println("- Jobs -------------------------");
            final DomNodeList<DomNode> jobTitles = page.querySelectorAll(".job-title");
            for (DomNode domNode : jobTitles) {
                System.out.println(domNode.asNormalizedText());
            }
            System.out.println("--------------------------------");
    
        }
    }