Search code examples
character-encodingphp-8.1php-8.2

mb_convert_encoding work incorrecly with japanese character


On Php8, mb_detect_encoding working incorrect. For example, i have code as bellow

$str = "苫小牧"; // assume that $str's encode is SJIS
// detect encoding of $str
$encode = mb_detect_encoding($str);
// convert 苫小牧 to EUC-JP 
$euc_str = mb_convert_encoding( $str, "EUC-JP",$encode);

In this case, mb_convert_encoding return garbled characters. The reason is mb_detect_encoding detect encoding incorrect. So, anybody can show solution for this case? I think it need to create other function to detect encode instead of mb_detect_encoding.


Solution

  • As the manual page for mb_detect_encoding says:

    Automatic detection of the intended character encoding can never be entirely reliable; without some additional information, it is similar to decoding an encrypted string without the key. It is always preferable to use an indication of character encoding stored or transmitted with the data, such as a "Content-Type" HTTP header.

    Part of the way the function combats this is to require you to provide a list of candidate encodings. If none is provided directly to the function, they are taken from a global configuration state (see mb_detect_order).

    For instance, taking the string you've provided, and using $encode = mb_detect_encoding($sjis_str, 'EUC-JP,UTF-8,SJIS'); returns 'SJIS', and the conversion appears to proceed correctly, as demonstrated here: https://3v4l.org/KKf2h

    The shorter the input, and the more candidates encodings you list, the more likely it is that mb_detect_encoding will guess wrong - the string may be equally valid in multiple encodings.

    It's also worth noting that if the string is not valid in any of the encodings you list, mb_detect_encoding will return false, so if you are using it to process unknown strings, you should check if ( $encode === false ) and add some appropriate error handling.