Search code examples
phpregexwordpresspodscms

Page scraping and regex give no results when they should


I have this little wp-based script that scrapes a web page and count the occurrence of 4 keywords using preg_match_all().

This is the code for a url that I know contains the keywords:

<?php

$url ='http://www.leggioggi.it/2013/08/16/i-tre-amici-discutono-di-servizio-sanitario-casuale-e-differenze-nord-sud/';

$response = wp_remote_get($url);

    $the_body = wp_remote_retrieve_body($response);
    //echo htmlentities($the_body);

    $matches = array();

    $matches_count = preg_match_all("/gravidanz|preconcezional|prenatal|concepimento/i", $the_body, $matches);

var_dump ($matches_count);
var_dump ($matches);
?>

I'm having some odd problems. On some pages I got zero matches, even though I know that the those pages contain the keywords. I noticed that for those pages, uncommenting the line echo htmlentities($the_body); solves the problem. If I comment it again the oddity is back.

My guess is that some caching mechanism is involved.

PS: the code is not written on a template file but in a pods framework page.

UPDATE: I put a var_dump($the_body); after the htmlentities line. The behavior is interesting. If echo htmlentities($the_body); is commented out the var_dump($the_body); returns an empty string; if the same line is active, var_dump($the_body); returns the whole page html. So I really don't get what's going on!

SOLVED: I checked the $response var (my bad not thinking about it) and I discovered that when indeed there was a remote server error, the error was reported in the response returned by wp_remote_get(). This is what I get back:

object(WP_Error)#30 (2) {
  ["errors"]=>
  array(1) {
    ["http_request_failed"]=>
    array(1) {
      [0]=>
      string(69) "Operation timed out after 5000 milliseconds with 25692 bytes received"
    }
  }
  ["error_data"]=>
  array(0) {
  }
}

Solution

  • I checked the $response var (my bad not thinking about it) and I discovered that indeed there was a remote server error, the error was reported in the response returned by wp_remote_get(). This is what I get back:

    object(WP_Error)#30 (2) {
      ["errors"]=>
      array(1) {
        ["http_request_failed"]=>
        array(1) {
          [0]=>
          string(69) "Operation timed out after 5000 milliseconds with 25692 bytes received"
        }
      }
      ["error_data"]=>
      array(0) {
      }
    }
    

    So it is solved. I'll just have to check for the http error and repeat the request a limited number of times and ignore the resource if a correct response is not given.