Search code examples
phprsssimplexml

simplexml_load_file and simplexml_load_string return the same data with different encoding


Trying to parse an UTF-8 RSS feed from an URL, i first tried this:

$flux = simplexml_load_file("https://mamanslouves.com/feed");
foreach($flux->channel->item as $Item){
        $title      = $Item->title;
        echo $title;
}

This code works, but titles containing accents (éèà) are converted to another charset. It appears that the following code fix the problem:

$raw = file_get_contents("https://mamanslouves.com/feed");
$flux = simplexml_load_string($raw);
foreach($flux->channel->item as $Item){...}

I would like to understand why.


Solution

  • Going by the discussion, I had with MarkusZeller in the comments, I think the answer should contain 2 components.

    First we need to look at the URL you're using. It is not the URL of the file you eventually download. A look in the network tab of the browser developer tools shows this:

    enter image description here

    First there are two permanent redirections (301) before the RSS feed itself is downloaded. Everything is UTF-8 encoded, the XML, and even the file you download. The only thing that isn't UTF-8 is the first redirect, it is iso-8859-1 encoded. You can see this by inspecting the headers in the network tab.

    Then we need to consider what simplexml_load_file() does. It needs to figure out the encoding of the file it downloads. There are many places it could get the encoding from: The HTTP headers of the redirects, the HTTP headers of the feed, or the XML content. It is now clear it uses the first thing it encounters: The HTTP header of the first redirect, which says iso-8859-1. So, what is really UTF-8 is read as iso-8859-1 and everything goes wrong from there. The misread characters are then converted to UTF-8, but that makes no sense, as you saw.

    To prove that it is the wrong charset in the first redirect that messes things up you can get the feed without the redirections:

    $flux = simplexml_load_file("https://mamanslouves.org/feed/");
    foreach($flux->channel->item as $Item){
            $title = $Item->title;
            echo $title;
    }
    

    And this does return normal accented letters.

    The reason that going through file_get_contents() does work is because this function doesn't care about the charset, it just gives you the binary data which is then later interpreted as a UTF-8 string. Exactly as Markus said.