Search code examples
phputf-8utf8-decode

Translate URLENCODED data into UTF-8 in PHP


I've got a string that is in my database like 中华武魂 when I post my request to retrieve the data via my website I'm getting the data to the server in the format %E4%B8%AD%E5%8D%8E%E6%AD%A6%E9%AD%82

What decoding steps to I have to take in order to get it back to the usable form? While also cleaning the user input to ensure they're not going to try an SQL injection attack? (escape string before or after encoding?)

EDIT:

 rawurldecode();  // returns "中åŽæ­¦é­‚"
 urldecode();     // returns "中åŽæ­¦é­‚"


public function utf8_urldecode($str) { 
    $str = preg_replace("/%u([0-9a-f]{3,4})/i","&#x\\1;",urldecode($str)); 
    return html_entity_decode($str,null,'UTF-8'); 
}
 // returns "中åŽæ­¦é­‚"

... which actually works when I try and use it in an SQL statement.

I think because I was doing an echo and die(); without specifying a header of UTF-8 (thus I guess that was reading to me as latin)

Thanks for the help!


Solution

  • When your data is actually that percent-encoded form, you just have to call rawurldecode:

    $data = '%E4%B8%AD%E5%8D%8E%E6%AD%A6%E9%AD%82';
    $str = rawurldecode($data);
    

    This suffices as the data already is encoded in UTF-8: (U+4E2D) is encoded with the byte sequence 0xE4B8AD in UTF-8 and that is encoded with %E4%B8%AD when using the percent-encoding.

    That your output does not seem to be as expected is probably because the output is interpreted with the wrong character encoding, probably Windows-1252 instead of UTF-8. Because in Windows-1252, 0xE4 represents ä, 0xB8 represents ¸, 0xAD represents å, and so on. So make sure to specify the output character encoding properly.