Search code examples
javahttpurlconnectionurlconnectionhttpsurlconnection

Http URLConnection wait for inner request


I am working on a crawling project. When I do a simple URLConnection connection to the website as shown in below:

URLConnection conn = new URL(url).openConnection(); BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));

The method returns the HTML body correctly. However, the website makes inner requests for some fields. For example, the website fetches the total number of users from a different web service. In the web browser, the total number of users appear after some time, but with the URLConnection method does not wait for the total number of users and the returned HTML does not contain such field.

In Java, is there any way to wait for a while to fetch all the data from a website using URLConnection.


Solution

  • From your "inner requests" comment it sounds like the website is using JavaScript (via a framework or just using native browser APIs) to fetch data and render these results into the DOM. This is very common nowadays with SPAs etc.

    If that's the case, no amount of waiting will change the outcome from using a simple HTTP library like URLConnection - but you can check this by saving the HTML locally and viewing it in your browser - what happens? When you examine it, is there JavaScript on that page?

    To do this properly in code, you'll need something capable of behaving more like a browser, and executing that JS referenced by the HTML in a DOM-like environment. Try Selenium with PhantomJS or headless Chrome / Firefox, or maybe GhostDriver.