Search code examples
phpweb-scrapingservercronfirewall

How do I reliably run a PHP script with a long run time triggered using a cron job


I want to crawl 3,000 URLs on a competitor's site to scrape a price value. I don't want to over-load their servers or otherwise trigger any firewall so first basic measure is to spread these 3,000 requests over the period of a week.

I have code set up like this:

set_time_limit(0);

foreach ($links as $link) {

    // crawl price from link

    // save price to database

    // 200 second delay before next crawl
    sleep(200);

}

I'm triggering this script using a cron job that runs at midnight every week.

The line set_time_limit(0) should override the max-execution time but I've read that scripts triggered by a cron-job are not subject to standard execution time limitations.

The problem is, for this script to run for a week I assume the server has to be completely stable with no down-time otherwise the crawl will fail.

How can I ensure this, and in the case there is down-time, then restart the crawl automatically from the point it failed? Is a week too long for a script to run? Otherwise I can condense it to a few hours, but still, there is enough time there for the script to be interrupted and I'm interested in the scenario where it is interrupted, how to handle this and complete the script.


Solution

  • Unless you set one in the php.ini, there is no max_execution_time when using the CLI which you are when running a CRON.

    Also if you are crawling each URL only once, and the url's are all different (not 200 all on the same site) then waiting between each seperate crawl is not necessary as the URL's are not linked or talking with each other

    If you want to restart a script, then all you need to do is write a "where I got to" to a file or the database. then whenever the script starts it checks for a restart point and goes from there. Remember to delete the restart file/db when the script completes so the next time you run the script it starts from the first url again.