I have this little wp-based script that scrapes a web page and count the occurrence of 4 keywords using preg_match_all().
This is the code for a url that I know contains the keywords:
<?php
$url ='http://www.leggioggi.it/2013/08/16/i-tre-amici-discutono-di-servizio-sanitario-casuale-e-differenze-nord-sud/';
$response = wp_remote_get($url);
$the_body = wp_remote_retrieve_body($response);
//echo htmlentities($the_body);
$matches = array();
$matches_count = preg_match_all("/gravidanz|preconcezional|prenatal|concepimento/i", $the_body, $matches);
var_dump ($matches_count);
var_dump ($matches);
?>
I'm having some odd problems. On some pages I got zero matches, even though I know that the those pages contain the keywords. I noticed that for those pages, uncommenting the line echo htmlentities($the_body);
solves the problem. If I comment it again the oddity is back.
My guess is that some caching mechanism is involved.
PS: the code is not written on a template file but in a pods framework page.
UPDATE:
I put a var_dump($the_body);
after the htmlentities line. The behavior is interesting. If echo htmlentities($the_body);
is commented out the var_dump($the_body); returns an empty string; if the same line is active, var_dump($the_body); returns the whole page html. So I really don't get what's going on!
SOLVED: I checked the $response var (my bad not thinking about it) and I discovered that when indeed there was a remote server error, the error was reported in the response returned by wp_remote_get(). This is what I get back:
object(WP_Error)#30 (2) {
["errors"]=>
array(1) {
["http_request_failed"]=>
array(1) {
[0]=>
string(69) "Operation timed out after 5000 milliseconds with 25692 bytes received"
}
}
["error_data"]=>
array(0) {
}
}
I checked the $response var (my bad not thinking about it) and I discovered that indeed there was a remote server error, the error was reported in the response returned by wp_remote_get(). This is what I get back:
object(WP_Error)#30 (2) {
["errors"]=>
array(1) {
["http_request_failed"]=>
array(1) {
[0]=>
string(69) "Operation timed out after 5000 milliseconds with 25692 bytes received"
}
}
["error_data"]=>
array(0) {
}
}
So it is solved. I'll just have to check for the http error and repeat the request a limited number of times and ignore the resource if a correct response is not given.