Search code examples
phpsimpledom

Simple DOM and Sony.co.za?


I have been running my simple dom script on a variety of pages for weeks, and never have I come across any issues. Now, today, when I try:

$html = file_get_html('http://www.sony.co.za/product/dsc-wx10');

I get:

( ! ) Warning: file_get_contents(http://www.sony.co.za/product/dsc-wx10) 
[function.file-get-contents]: failed to open stream: HTTP request failed!
 in C:\XXXXXXX\simplephpdom\simple_html_dom.php on line 70

What could possibly cause me to not be able to enter the code above with success, when the following works:

 $html = file_get_html('http://www.google.com');
 $html = file_get_html('http://www.whatever.com');

I am able to access the Sony page via my browser. And as far as I understand, the code above connects to port 80, just like I do. So I find it hard to believe I'm being blocked. And also, I was blocked from Day 1.

Any ideas what could be causing this?


Solution

  • The site seems to delay requests containing the PHP user agent forever. Sounds like a really, really lame attempt to stop crawlers.

    The solution is simple: Use curl to send the request and specify a "normal" useragent.


    Update: Apparently it also blocks empty/missing user agents:

    > nc www.sony.co.za 80
    nc: using stream socket
    GET / HTTP/1.0
    Host: www.sony.co.za
    User-Agent: Mozilla Firefox
    
    HTTP/1.0 301 Moved Permanently
    ...
    

    vs

    > nc www.sony.co.za 80
    nc: using stream socket
    GET / HTTP/1.0
    Host: www.sony.co.za
    [no response]