Search code examples
phpjsoncharacter-encodingsanitization

Strange characters in (invalid) json string from post request (encoding issues)


I am trying to get data from a post request using the following line :

$data = file_get_contents('php://input');

The json string might be like: {"test" : "test one \xe0 "}

The problem is when I try to do a json_decode($data), I get null. By var_dump()ing $data, I see some characters like \xe0 \xe7a.

The data sent is in utf-8. I use utf8_decode($data) as well, but with no luck. Could someone explain what I am missing or how to solve this issue?

I need to convert the invalid json from:

$data = '{"test" : "test one \xe0 "}';

to:

$data = '{"test" : "test one à "}';

Solution

  • Mutating a json string with string functions will always be something to be done with apprehension because it is generally easy for a false positive replacement to damage the payload. That said, here is a script to attempt to correct your invalid json string.

    Code: (Demo)

    $json = '{"test" : "test one \xe0, \x270B"}';
        
    $json = preg_replace_callback(
               '/\\\\x([[:xdigit:]]+)/',
               fn($m) => sprintf('\u%04s', $m[1]),
               $json
         );
         
    echo "\n" . var_export(json_validate($json), true);
    echo "\n$json\n";
    var_export(json_decode($json));
    

    Output:

    true
    {"test" : "test one \u00e0, \u270B"}
    (object) array(
       'test' => 'test one à, ✋',
    )
    

    If this has known flaws, please leave a comment below and I'll endeavor to overcome the issue when I have time.

    A related answer of mine: Replace all hex sequences with ascii characters