Search code examples
phptimeoutweb-scrapingphantomjs

How to resend a request on resource timeout in PhantomJS?


I have a PHP script that scrapes the web and inserts the scraped data into a database using PhantomJS.
Currently, on a resource timeout in PhantomJS I cancel the whole request of the page and request the whole page again through PHP.
Here is my code:

page.settings.resourceTimeout = 5000; // 5 seconds
page.onResourceTimeout = function(e) {
  console.log(e.errorCode);   // it'll probably be 408 
  console.log(e.errorString); // it'll probably be 'Network timeout on resource'
  console.log(e.url);         // the url whose request timed out
  phantom.exit(1);
};

I want only to resend a request to the resource that timed out, and not request the whole page all over again. Is this possible?


Solution

  • You can resend the (GET) request, but this won't help you much, because the reason of the request is different.

    Resource requests happen automatically when for example a javascript file is referenced in a <script> tag. You can download it with PhantomJS through XHR, but it is likely that other scripts that depend on it, already tried to run and failed. You would have to re-run all of them again. This is really tedious.
    Other resources like CSS files or images are not that timing sensitive and can be re-downloaded. But when you do, you have to insert them to the correct place. Let's take a CSS file for example.

    1. You can detect that it was a CSS resource from the request headers or from the url,
    2. check the DOM that the resource is actually referenced,
    3. copy the DOM node with all its attributes (and innerHTML) to a new DOM node,
    4. remove the old one and insert the new one. Nothing has changed, but it should have prompted the browser to download the resource again. All of this has to be done in the page context in the page.evaluate callback.

    XHR requests are explicitly sent through the page. So every request has a finish/error callback. You can't access those callbacks from outside so, it will not work to rerun those requests, because the actions that happen after those requests won't be called.

    You may want to run PhantomJS with the --disk-cache=true option, so that it takes less time to run the page request again.