I'm trying to scrape live data from 50+ dynamic webpages and need the data to be updated every 1-2 seconds. To do so, I have a Timer scheduled every 1/2 second that iterates through the following method 50 times (for 50 URLs):
public double fetchData(String link) {
String data = null;
try {
URL url = new URL();
urlConn = url.openConnection(link);
InputStreamReader inStream = new InputStreamReader(urlConn.getInputStream());
BufferedReader buff = new BufferedReader(inStream);
/*code that scrapes webpage, stores value in "data"*/
inStream.close();
buff.close();
} catch (IOException e) {
e.printStackTrace();
}
return data;
}
This method works but takes about a second per URL, or 50 sec total. I've also tried JSoup in hopes that the delay may be overcome using the following code:
public double fetchData(String link, String identifier) {
Document doc;
String data = null;
try {
doc = Jsoup.connect(link).timeout(10*1000).get();
data = doc.getElementById(identifier).parent().child(0).text();
} catch (IOException e) {
e.printStackTrace();
}
return data;
}
but have run into approximately the same processing time. Are there any faster ways to draw data from dynamic webpages simultaneously, whether through URLConnection, JSoup, or some other method?
The short answer is "use threads". Create a thread for each of the 50+ URLs that you want to scrape repeatedly.
It will most likely make little difference if you use URLConnection, JSoup or some other way do the scraping. The actual bottleneck is likely to be due to:
The first of those is outside of your control (in a positive way!). The last two ... you might be able to address but only by throwing money at the problem. For example, you could pay for a better network connection / path, or pay for alternative hosting to move your scraper close to the sites you are trying to scrape.
Switching to multi-threaded scraping will ameliorate some of those bottlenecks, but not eliminate them.
But I don't think what you are doing is a good idea.
If you write something that repeatedly re-scrapes the same pages once every 1 or 2 seconds, they are going to notice. And they are going to take steps to stop you. Steps that will be difficult to deal with. Things like:
And if that doesn't help, maybe more serious things.
The real solution may be to get the information a more efficient way; e.g. via an API. This may cost you money too. Because (when it boils down to it) your scraping will be costing them money for either no return ... or a negative return if your activity ends up reducing real peoples' clicks on their site.