Search code examples
phphtmlfilecharacter-encodingfile-get-contents

File get contents retrieve question diamonds characters


I've created my crawler using file_get_contents function but when I crawl some sites I'm getting this character: � when I should get this: é. Some ideas of what is happening?

This is for a windows vps server running php.

I've already tried:

But all these things didn't work.

PD: My file where I'm running this code is on UTF8.

    $url = "https://play.google.com/books/reader?id=4rqYDwAAQBAJ&hl=en_US";
    $options = array('http'=>array('method'=>"GET", 'header'=>"Accept-language: en-US,en;q=0.8\r\n" ."Accept-Charset: UTF-8, *;q=0"));
            $context = stream_context_create($options)
            $profile = file_get_contents($url,false,$context);
echo $profile

I'm expecting to get accented characters and not this diamond character �.


Solution

  • Google is ignoring your Accept-Charset header because you're not specifying a User-Agent, no idea why. It took me one hour to figure it out. Adjust your options as follows:

    $options = [
        "http" => [
            "method" => "GET",
            "header" => "Accept-language: en-US,en;q=0.8\\r\n" .
                        "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0\r\n" .
                        "Accept-Charset: UTF-8, *;q=0"
                  ]
    ];
    

    Adding the "User-Agent" header seems to do the trick. Google is probably returning a different encoding if not.