Search code examples
phpcurlcronscaling

Scaling curl with php and cron


I am trying to create a website monitoring webapp using PHP. At the minute I'm using curl to collect headers from different websites and update a MySQL database when a website's status changes (e.g. if a site that was 'up' goes 'down').

I'm using curl_multi (via the Rolling Curl X class which I've adapted slightly) to process 20 sites in parallel (which seems to give the fastest results) and CURLOPT_NOBODY to make sure only headers are collected and I've tried to streamline the script to make it as fast as possible.

It is working OK and I can process 40 sites in approx. 2-4 seconds. My plan has been to run the script via cron every minute... so it looks like I will be able to process about 600 websites per minute. Although this is fine at the minute it won't be enough in the long term.

So how can I scale this? Is it possible to run multiple crons in parallel or will this run into bottle-necking issues?

Off the top of my head I was thinking that I could maybe break the database into groups of 400 and run a separate script for these groups (e.g. ids 1-400, 401-800, 801-1200 etc. could run separate scripts) so there would be no danger of database corruption. This way each script would be completed within a minute.

However it feels like this might not work since the one script running curl_multi seems to max out performance at 20 requests in parallel. So will this work or is there a better approach?


Solution

  • yes, the simple solution is use the same PHP CLI script and pass the args 1 and 2 i.e., indicates the min and max range to process the db record contains the each site information.

    Ex. crontab list
    * * * * * php /user/script.php 1 400
    * * * * * php /user/script.php 401 800

    Or using a single script, you can trigger multi-threading (multi-threading in PHP with pthreads). But the cron interval should be based on the benchmark of completion of 800 sites.

    Ref: How can one use multi threading in PHP applications

    Ex. the script multithread completes in 3 minutes then give the interval as */3.