Search code examples
phpmongodbutf-8simplexmlamazon-mws

How to convert this UTF-8 escaped string from an Amazon MWS response to proper UTF-8?


In part of an XML Amazon MWS ListOrders response we got an escaped UTF-8 character in one element:

<Address><Name>Ram&#xC3;&#xAD;rez Jones</Name></Address>

The name is supposed to be Ramírez. The diacritic character í is UTF-8 character U+00ED (\xc3\xad in literal; see this chart for reference).

However PHP's SimpleXML function mangles this string(which you can see because I simply pasted), transforming it into

Ramírez Jones

into the editor box here (evidently stackoverflow's ASP.NET underpinnings do the same thing as PHP).

Now when this mangled string gets saved into, then pulled out of MongoDB, it then becomes

RamÃ-­rez Jones

For some reason a hyphen is inserted there, although believe it or not, if you select the above bold text, then paste it back into a StackOverflow editor window, it will simply appear as Ramírez (the hyphen mysteriously vanishes, at least on OS X 10.8.5)!

Here is some example code to show this problem:

$xml = "<Address><Name>Ram&#xC3;&#xAD;rez Jones</Name></Address>";
$elem = new SimpleXMLAddressent($xml);
$bad_string = $elem->Name;
echo mb_detect_encoding($bad_string)."\n";
echo $elem->Name->__toString()."\n";
echo iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $elem->Name->__toString());

Here is the output from the above sample code, as run on onlinephpfunction.com's sandbox:

UTF-8
Ramírez Jones
RamA-rez Jones

How can we avoid this problem? It's really screwing things up.

EDIT:

Let me add that while the name in the XML is supposed to be Ramírez Jones, I need to transliterate it to Ramirez Jones (strip the diacrtic mark off of the í).

REVISED FINAL SOLUTION:

It's different than the correct answer below but this was the most elegant solution that I found. Just replace the last line of the example with this:

echo iconv('UTF-8','ASCII//TRANSLIT', html_entity_decode($xml));

This works because "&#xC3;&#xAD;" are HTML entities.

ALTERNATE SOLUTION

Strangely, this also works:

$xml = '<?xml version="1.0"?><Address><Name>Ram&#xC3;&#xAD;rez Jones</Name></Address>';
$xml= str_replace('<?xml version="1.0"?>', '<?xml version="1.0" encoding="ISO-8859-1"?>' , $xml);
$domdoc = new DOMDocument();
$domdoc->loadXML($xml);
$xml = iconv('UTF-8','ASCII//TRANSLIT',$domdoc->saveXML());
$elem = new SimpleXMLElement($xml);
echo $elem->Name; 

Solution

  • SimpleXML does not decode the hex entities and understand the result as UTF-8, because that's not how XML or UTF-8 actually works. Nevertheless, if Amazon produces such nonsense, you need to correct that error before parsing it as XML.

    function decode_hexentities($xml) {
      return
        preg_replace_callback(
          '~&#x([0-9a-fA-F]+);~i', 
          function ($matches) { return chr(hexdec($matches[1])); }, 
          $xml
        );
    }
    
    $xml = "<Address><Name>Ram&#xC3;&#xAD;rez Jones</Name></Address>";
    $xml = decode_hexentities($xml);
    $elem = new SimpleXMLElement($xml);
    $bad_string = $elem->Name;
    echo mb_detect_encoding($bad_string)."\n";
    echo $elem->Name->__toString()."\n";
    echo iconv('UTF-8', 'ASCII//TRANSLIT', $elem->Name->__toString());
    

    results in:

    UTF-8
    Ramírez Jones
    Ramirez Jones