Search code examples
phpjsonapiunicodeunicode-escapes

How to decode "Unicode" without "\" like u091cu0940u0935u0928


I am using facebook api to capture leads,

I am getting JSON which i am using after saving to DB as text.

{"created_time":"2020-12-23T04:57:39+0000","id":"1021093571702954","field_data": 
[{"name":"full_name","values":["u091cu0940u0935u0928 u091au094cu0939u093eu0928"]}, 
{"name":"city","values":["delhi"]},{"name":"phone_number","values":["+919911152366"]}, 
{"name":"email","values":["uiabhiu0040gmail.com"]},{"name":"zip_code","values":["110095"]}]}

for email I found that u0040 represent "@" so I used string replace in Php, but now problem is that I am getting some name in this format, i am not sure how to decode it.


Solution

  • Using the intl extension (make sure it's installed, it may be as simple as uncommenting it in your php.ini and restarting your server):

    function translateUnescapedUnicode(string $subject): string
    {
        return preg_replace_callback('/u([0-9a-fA-F]{4})/', function ($match) {
            return IntlChar::chr(hexdec($match[1]));
        }, $subject);
    }
    

    What's happening here:

    1. We're capturing the unescaped unicode sequences (u followed by 4 hex characters)
    2. $match[0] will have the full match (uXXXX), while $match[1] will have only our capturing group (([0-9a-fA-F]{4})) - the hex characters
    3. We're turning our hex value to a decimal value using hexdec
    4. We're feeding the decimal value to IntlChar::chr per documentation:

    Returns a string containing the character specified by the Unicode code point value.

    Testing it on your JSON:

    $decodedJson = json_decode($json, true);
    foreach ($decodedJson['field_data'] as $fieldData) {
        var_dump(translateUnescapedUnicode($fieldData['values'][0]));
    }
    

    will produce the following:

    string(28) "जीवन चौहान"
    string(5) "delhi"
    string(13) "+919911152366"
    string(16) "uiabhi@gmail.com"
    string(6) "110095"
    

    So you can see here that it preserves the regular strings without any unicode characters.