Search code examples
phpunicodesimplexml

simplexml_load_string - parse error due to unicode characters in payload


I have a problem with simplexml_load_string erring with parse errors due to an xml payload coming from a database with unicode characters in it.

I'm at a loss how to get php to read this and use the xml like I normally would. The code has been working fine until people were getting creative with data being submitted.

Unfortunately I cannot modify the source data, I have to work with what I receive, to give you an idea, one field that's breaking it in the original raw receipt looks like :

<FirstName>🐺</FirstName>

Previously the code works fine by parsing the xml with a simple line of :

$xmlresult = simplexml_load_string($result, 'SimpleXMLElement',LIBXML_NOCDATA);

However with these unicode characters, it just errors. Depending on what I use to view the data if I dump the raw payload it can look like:

 <d83d><dc3a>

or <U+D83D><U+DC3A>

Reading a bit on stack, it seemed DOM might work but didn't have any luck there either.

The incoming payload does have the header:

?xml version="1.0" encoding="UTF-8"?>

data comes in via

<data type="cdata"><![CDATA[<payload>

I'm at a complete loss, hopefully can get some help here to get me over this hump with this data handling.


Solution

  • I've been staring at this for days and it seems one thing I didn't try was to wrap my curl call function with utf8_encode like this :

        $result = utf8_encode(do_curl($xmlbuildquery));
    

    My do_curl function is just a separate function to call the curl procedure, nothing more. Doing that, I'm able to parse the results, instead of those unicode characters showing up, instead its displaying as

    [firstname] => 🐺
    

    (the above is result of print_r($result); after
    $xmldata = simplexml_load_string((string)$xmlresult->body->function->data);

    With that in place the xml is now parsing finally. Oddly this sparked my curiosity further as this information is provided via csv thats imported into a mysql database and when I look up the same record its shown as :

     FirstName: ????
    

    with the table type set too : FirstName varchar(40) COLLATE utf8mb4_unicode_ci NOT NULL,

    That might suggest their not utf8_encoding the output to the csv perhaps, separate from this issue but just interesting.

    And finally, my script is able to run again!!