I have a PHP script that scrapes the web and inserts the scraped data into a database using PhantomJS.
Currently, on a resource timeout in PhantomJS I cancel the whole request of the page and request the whole page again through PHP.
Here is my code:
page.settings.resourceTimeout = 5000; // 5 seconds
page.onResourceTimeout = function(e) {
console.log(e.errorCode); // it'll probably be 408
console.log(e.errorString); // it'll probably be 'Network timeout on resource'
console.log(e.url); // the url whose request timed out
phantom.exit(1);
};
I want only to resend a request to the resource that timed out, and not request the whole page all over again. Is this possible?
You can resend the (GET) request, but this won't help you much, because the reason of the request is different.
Resource requests happen automatically when for example a javascript file is referenced in a <script>
tag. You can download it with PhantomJS through XHR, but it is likely that other scripts that depend on it, already tried to run and failed. You would have to re-run all of them again. This is really tedious.
Other resources like CSS files or images are not that timing sensitive and can be re-downloaded. But when you do, you have to insert them to the correct place. Let's take a CSS file for example.
page.evaluate
callback.XHR requests are explicitly sent through the page. So every request has a finish/error callback. You can't access those callbacks from outside so, it will not work to rerun those requests, because the actions that happen after those requests won't be called.
You may want to run PhantomJS with the --disk-cache=true
option, so that it takes less time to run the page request again.