I am testing a PHP script to scrape a remote site using the Simple HTML DOM Parser library. The code used to work fine; however, it suddenly stopped today.
<?php
require_once 'backend/connector.php';
require_once 'table_access/simplehtmldom_1_5/simple_html_dom.php';
ini_set("display_errors", 1);
error_reporting(E_ALL);
echo file_get_html("http://www.google.com");
?>
The error it's giving is:
Warning: file_get_contents(http://www.google.com): failed to open stream: Connection timed out in /home/peppyoil/public_html/sandboxassets/engines/table_access/simplehtmldom_1_5/simple_html_dom.php on line 75
I don't understand why it's timing out repeatedly despite the remote site being very much available when accessed through the browser. I would understand it it said connection refused or something like that but what could possibly explain the timing-out?
I tried using cURL:
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.google.com');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_PROXY, $proxy); // $proxy is ip of proxy server
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$httpCode = curl_getinfo($ch , CURLINFO_HTTP_CODE); // this results 0 every time
$response = curl_exec($ch);
if ($response === false) $response = curl_error($ch);
echo stripslashes($response);
curl_close($ch);
?>
Didn't work this time either threw the following error instead:
Connectiontimed out after 10001 milliseconds
The test script given above is sitting at http://www.peppyburro.com/sandboxassets/engines/test1.php
Update 2: Just checked my port 80 and found this:
Outbound Port 80, 443, 587 and 465 for your account are BLOCKED Reason for the port block: During our regular scans, we have found malicious files in your account which may be infected with malware.
Could this have anything to do with the timeouts?
Outbound Port 80, 443, 587 and 465 for your account are BLOCKED Reason for the port block: During our regular scans, we have found malicious files in your account which may be infected with malware.
It is already stated above, that your hosting provider has found your site's content as malicious.
It is because what you are trying to achieve is similar to a proxy server and comes under the section of URL rewriting sites. therefore you cannot host that script, cause it can be used to directly access that content which are blocked in your region but not in the region of your hosting provider.
Hope this helps.