Search code examples
phpiiscronscheduled-taskssqlsrv

Efficient way to test dead 10,000+ URLs stored on database using PHP/MS SQL/Task Scheduler


I have a database table which holds 10,000+ entries with URLs. I am testing to see whether URLs are active or dead using get_headers() PHP function and I have set a task which runs every hour (entries are tested in chuncks of 500 approx.)

It works fine but, it times out because, it has to enquire lots of entries at once.

Is there a better way of doing this and reduce server load at the same time? I also tried CURL but, it has the same timeout issue.

Thanks


Solution

  • According to this comment on PHP's documentation it may be better to use cURL.

    • I would store in the DB a timestamp of the last URL check for each entry.
    • Create a cron job that triggers a PHP script every minute or every x minutes.

    The PHP cron task:

    • I would query the database to get all the links with a timestamp older than one day (or x hours), sorted by older first and a limit of let's say 20 links or more.
    • Loop over these links and check each link with cURL:
      • Set correct cURL CURLOPT_CONNECTTIMEOUT and CURLOPT_TIMEOUT timeouts so that the check can be done but without exceeding a long time also. Perhaps 5 to 10 seconds?
      • Check the URL and update the timestamp once it's done.
      • See what getrusage() returns you to see if you are soon arriving at the ini_get('max_execution_time') value. Or create your own timer with microtime(). But the idea is to break the loop if needed to avoid the script to be suddenly killed.

    You could then try to improve speed by using curl_multi_*() routines. I just didn't find a nice solution where you read the result of each finished handle in order to update the DB. If you have to wait for all requests to finish then why not making some packages of URL requests and run them all. This is what I tried quickly, but by reading the result (which would not be your case):

    <?php
    
    define('TIMEOUT', 8);
    
    $urls = [
        'https://www.php.net/manual/fr/function.curl-multi-select.php',
        'https://www.bag.admin.ch/bag/fr/home/krankheiten/ausbrueche-epidemien-pandemien/aktuelle-ausbrueche-epidemien/novel-cov/testen.html',
        'https://ofsp-coronavirus.ch/tester/',
        'https://www.liguepulmonaire.ch/nc/fr/maladies-et-consequences/bpco/test-de-risque-bpco.html',
        'https://www.ge.ch/covid-19-se-faire-tester/faire-test-covid-19',
        'https://www.gsi.be.ch/fr/start/themen/coronavirus/testen/wo-kann-ich-einen-covid-19-test-machen-.html',
        'https://diagnostics.roche.com/ch/fr/products/params/sars-cov-2-rapid-antigen-nasal-test.html',
        'https://www.unispital-basel.ch/fr/patients-proches/votre-sejour/massnahmen-coronavirus/test-coronavirus/',
        'https://www.insel.ch/fr/patients-et-visiteurs/coronavirus/test-du-covid-dans-linsel-gruppe',
        'https://www.bag.admin.ch/bag/en/home/krankheiten/ausbrueche-epidemien-pandemien/aktuelle-ausbrueche-epidemien/novel-cov/testen.html',
        'https://www.fr.ch/sante/covid-19/test-covid-19-liste-des-pharmacies-autorisees',
        'https://www.fr.ch/document/425951',
        'https://www.m3-test.ch/',
        'https://www.ehnv.ch/covid-19',
        'https://covid-19testing.genepredictis.com/',
        'https://www.vd.ch/themes/mobilite/automobile/test-schuhfried/',
        'https://www.hirslanden.ch/fr/corporate/campagne/covid-19-test/tests-repetitifs-preventifs-entreprises-ecoles.html',
    ];
    
    $multi_handle = curl_multi_init();
    
    $handles = [];
    
    foreach ($urls as $i => $url) {
        $handles[$i] = curl_init();
        curl_setopt_array($handles[$i], [
            CURLOPT_URL => $url,
            CURLOPT_HEADER => false,
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_CONNECTTIMEOUT => TIMEOUT,
            CURLOPT_TIMEOUT => TIMEOUT,
        ]);
        curl_multi_add_handle($multi_handle, $handles[$i]);
    }
    
    $running = null;
    
    do {
        curl_multi_exec($multi_handle, $running);
        usleep(1000);
    } while ($running > 0);
    
    $results = [];
    
    foreach ($handles as $i => $handle) {
        $results[$i] = curl_multi_getcontent($handle);
        curl_multi_remove_handle($multi_handle, $handle);
    }
    
    curl_multi_close($multi_handle);
    
    var_export($results);