I have been running my simple dom script on a variety of pages for weeks, and never have I come across any issues. Now, today, when I try:
$html = file_get_html('http://www.sony.co.za/product/dsc-wx10');
I get:
( ! ) Warning: file_get_contents(http://www.sony.co.za/product/dsc-wx10)
[function.file-get-contents]: failed to open stream: HTTP request failed!
in C:\XXXXXXX\simplephpdom\simple_html_dom.php on line 70
What could possibly cause me to not be able to enter the code above with success, when the following works:
$html = file_get_html('http://www.google.com');
$html = file_get_html('http://www.whatever.com');
I am able to access the Sony page via my browser. And as far as I understand, the code above connects to port 80, just like I do. So I find it hard to believe I'm being blocked. And also, I was blocked from Day 1.
Any ideas what could be causing this?
The site seems to delay requests containing the PHP user agent forever. Sounds like a really, really lame attempt to stop crawlers.
The solution is simple: Use curl to send the request and specify a "normal" useragent.
Update: Apparently it also blocks empty/missing user agents:
> nc www.sony.co.za 80
nc: using stream socket
GET / HTTP/1.0
Host: www.sony.co.za
User-Agent: Mozilla Firefox
HTTP/1.0 301 Moved Permanently
...
vs
> nc www.sony.co.za 80
nc: using stream socket
GET / HTTP/1.0
Host: www.sony.co.za
[no response]