Search code examples
phpencodingutf-8chinese-locale

strftime(): Encoding is wrong for chinese, russian and hungarian


What I am trying to do is rather simple: I want to print a date (timestamp) in chinese (or russian).

For all languages I am using

setlocale(LC_TIME, 'hu_HU.utf8', 'hu_HU.UTF-8', 'hu_HU', 'hr');
$date = strftime('%a %e %b %Y, %H:%M');

$date = utf8_encode($date);

This returns an UTF-8 String even without the utf8_encode(). Everything is fine. Now when I do the exact same with the 'zh_CN.utf8' locale (or 'zh_CN.UTF-8', 'zh_CN' or 'zh') this does not return the correct date. With or without the utf8_encode() this returns

'2018å¹?mæ?#dæ?'

I don't speak chinese but this is obviously wrong. I found out that it should return something like '年'. This character has the UTF-8 hex encoding E5 B9 B4 but when I look at the returned String there are the wrong hex values. There is (after 2018) C3 A5 C2 B9 3F 6D C3 A6 ....

When I check the encoding of the returned String with mb_detect_encoding() this always returns UTF-8. I was expecting that because I am using the 'zh_CN.utf8' locale which is setting the encoding to UTF-8.

After looking around quite some time I came across this answer of Peter. He suggests using the format '%Y年%m月%e日' in the strftime() function. When I use this I get the same result as before.

This leads me to the thought that the encoding is wrong. But is this true? Is the encoding wrong? How do I convert the result to the correct encoding?

I have more less the same problem for russian language.


Solution

  • The solution

    I spent several hours and I found the correct encodings. strftime() is not delivering an UTF-8 String. For details have a look at the bottom of this answer. I ended up with a formatTime() function which is delivering me the correct time in the correct encoding (UTF-8 for me).

    function formatTime($format, $language = null, $timestamp = null){
        switch($language){
            case 'chinese':
                $locale = setlocale(LC_TIME, 'zh_CN.utf8', 'zh_CN.UTF-8', 'zh_CN', 'zh');
                break;
            case 'hungarian':
                $locale = setlocale(LC_TIME, 'hu_HU.utf8', 'hu_HU.UTF-8', 'hu_HU', 'hr');
                break;
            case 'russian':
                $locale = setlocale(LC_TIME, 'ru_RU.utf8', 'ru_RU.UTF-8', 'ru_RU', 'ru');
                break;
            case 'german':
                $locale = setlocale(LC_TIME, 'de_DE.utf8', 'de_DE.UTF-8', 'de_DE', 'de');
                break;
            case 'french':
                $locale = setlocale(LC_TIME, 'fr_FR.utf8', 'fr_FR.UTF-8', 'fr_FR', 'fr');
                break;
            case 'polish':
                $locale = setlocale(LC_TIME, 'pl_PL.utf8', 'pl_PL.UTF-8', 'pl_PL', 'pl');
                break;
            case 'turkish':
                $locale = setlocale(LC_TIME, 'tr_TR.utf8', 'tr_TR.UTF-8', 'tr_TR', 'tr');
                break;
            case 'english':
                $locale = setlocale(LC_TIME, 'en_GB.utf8', 'en_GB.UTF-8', 'en_GB', 'en');
                break;
            // ...
            default: break;
        }
    
        if(!is_numeric($timestamp)){
            $datetime = strftime($format);
        }
        else{
            $datetime = strftime($format, $timestamp);
        }
    
        $current_locale = strtolower(setlocale(LC_TIME, 0));
    
        if(($pos = strpos("utf", $current_locale)) === false || strpos("8", $current_locale, $pos) === false){
            // UTF-8 locale is not used, the encodings are found out with the code shown below
            $locale_default_encodings = array(
                "german" => "ISO-8859-1",
                "french" => "ISO-8859-1",
                "polish" => "ISO-8859-2",
                "turkish" => "ISO-8859-9",
                // Testing hungarian results in "Windows-1252", but php.net recommends to 
                // use ISO-8859-2, in fact Windows-1252 is based on ISO-8859-2 so it should 
                // (hopefully) work with both (*)
                "hungarian" => "ISO-8859-2", 
                "chinese" => "CP936",
                "russian" => "KOI8-R"
            );
            $target_encoding = mb_internal_encoding(); // or "UTF-8" or whatever
    
            if(isset($locale_default_encodings[$language])){
                $datetime = mb_convert_encoding(
                    $datetime, 
                    $target_encoding, 
                    $locale_default_encodings[$language]
                );
            }
            else{
                // try to avoid this case
                $datetime = mb_convert_encoding($datetime, $target_encoding);
            }
        }
    
        setlocale(LC_TIME, $locale);
    
        return $datetime;
    }
    

    (*): http://php.net/manual/de/function.strftime.php#94399

    The long long way

    I checked out the strftime("%B") result for the specific language. This is the full month name. I checked the translation for my languages, then I looked up the hex values for UTF-8 for the different letters of the translation.

    Now I am iterating through all the encodings that are supported by php. I convert the result given by strftime() from the current iterated encoding to UTF-8. Now I can compare result of strftime() converted to UTF-8 to the hex values of the manual translations which are also the hex values for UTF-8. If they match the result of strftime() has the encoding of the current interated encoding.

    I choose the hex values because they defenetly are the same and do not depend on the internal encoding because they are ASCII Strings (or even numbers in php).

    This gives me the following output, the code is posted below:

    <html>
        <head>
            <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
        </head>
        <body>
            <h1>Detecting the font encoding of <code>strftime()</code>
            </h1>
            <h2>hungarian</h2>
            <p>
                <code>strftime()</code> for March for language hungarian. Expected hex:  <code>6fc5be756a616b</code>, converted expected hex to string: <code>ožujak</code>
            </p>
            <table>
                <tr>
                    <td>initial return value</td>
                    <td>oߵjak</td>
                    <td>6f9e756a616b</td>
                </tr>
    
                <tr>
                    <td colspan='3'>Encodings that deliver the correct result:</td>
                </tr>
                <tr style='background: green;'>
                    <td>Windows-1252</td>
                    <td>ožujak</td>
                    <td>6fc5be756a616b</td>
                </tr>
            </table>
            <h2>chinese</h2>
            <p>
                <code>strftime()</code> for December for language chinese. Expected hex:  <code>e58d81e4ba8ce69c88</code>, converted expected hex to string: <code>十二月</code>
            </p>
            <table>
                <tr>
                    <td>initial return value</td>
                    <td>ʮ׾Ղ</td>
                    <td>caaeb6fed4c2</td>
                </tr>
    
                <tr>
                    <td colspan='3'>Encodings that deliver the correct result:</td>
                </tr>
                <tr style='background: green;'>
                    <td>EUC-CN</td>
                    <td>十二月</td>
                    <td>e58d81e4ba8ce69c88</td>
                </tr>
                <tr style='background: green;'>
                    <td>CP936</td>
                    <td>十二月</td>
                    <td>e58d81e4ba8ce69c88</td>
                </tr>
                <tr style='background: green;'>
                    <td>GB18030</td>
                    <td>十二月</td>
                    <td>e58d81e4ba8ce69c88</td>
                </tr>
            </table>
            <h2>russian</h2>
            <p>
                <code>strftime()</code> for December for language russian. Expected hex:  <code>d0b4d095d099d0aed090d09fd0ad</code>, converted expected hex to string: <code>дЕЙЮАПЭ</code>
            </p>
            <table>
                <tr>
                    <td>initial return value</td>
                    <td>ť롡𼼯td>
                    <td>c4e5eae0e1f0fc</td>
                </tr>
    
                <tr>
                    <td colspan='3'>Encodings that deliver the correct result:</td>
                </tr>
                <tr style='background: green;'>
                    <td>KOI8-R</td>
                    <td>дЕЙЮАПЭ</td>
                    <td>d0b4d095d099d0aed090d09fd0ad</td>
                </tr>
                <tr style='background: green;'>
                    <td>KOI8-U</td>
                    <td>дЕЙЮАПЭ</td>
                    <td>d0b4d095d099d0aed090d09fd0ad</td>
                </tr>
            </table>
        </body>
    </html>

    Note that this html is encoded in UTF-8. Still the result given by the strftime() function is wrong! This has nothing to do with the browser or editor encoding as pointed out in the comments.

    $encodings = array(
        "UCS-4",
        "UCS-4BE",
        "UCS-4LE",
        "UCS-2",
        "UCS-2BE",
        "UCS-2LE",
        "UTF-32",
        "UTF-32BE",
        "UTF-32LE",
        "UTF-16",
        "UTF-16BE",
        "UTF-16LE",
        "UTF-7",
        "UTF7-IMAP",
        "UTF-8",
        "ASCII",
        "EUC-JP",
        "SJIS",
        "eucJP-win",
        "SJIS-win",
        "ISO-2022-JP",
        "ISO-2022-JP-MS",
        "CP932",
        "CP51932",
        "SJIS-mac",
        "SJIS-Mobile#DOCOMO",
        "SJIS-Mobile#KDDI",
        "SJIS-Mobile#SOFTBANK",
        "UTF-8-Mobile#DOCOMO",
        "UTF-8-Mobile#KDDI-A",
        "UTF-8-Mobile#KDDI-B",
        "UTF-8-Mobile#SOFTBANK",
        "ISO-2022-JP-MOBILE#KDDI",
        "JIS",
        "JIS-ms",
        "CP50220",
        "CP50220raw",
        "CP50221",
        "CP50222",
        "ISO-8859-1",
        "ISO-8859-2",
        "ISO-8859-3",
        "ISO-8859-4",
        "ISO-8859-5",
        "ISO-8859-6",
        "ISO-8859-7",
        "ISO-8859-8",
        "ISO-8859-9",
        "ISO-8859-10",
        "ISO-8859-13",
        "ISO-8859-14",
        "ISO-8859-15",
        "ISO-8859-16",
        "byte2be",
        "byte2le",
        "byte4be",
        "byte4le",
        "BASE64",
        "HTML-ENTITIES",
        "7bit",
        "8bit",
        "EUC-CN",
        "CP936",
        "GB18030",
        "HZ",
        "EUC-TW",
        "CP950",
        "BIG-5",
        "EUC-KR",
        "UHC",
        "ISO-2022-KR",
        "Windows-1251",
        "Windows-1252",
        "CP866",
        "KOI8-R",
        "KOI8-U",
        "ArmSCII-8"
    );
    
    $show_wrong_encodings = false;
    $internal_encoding = "UTF-8";
    mb_internal_encoding($internal_encoding);
    
    $languages = array(
        // name of the language => hex in UTF-8 and timestamp to check
        "german" => array("4dc3a4727a", 1520343439), // march
        "french" => array("64c3a963656d627265", 1544103703), // december
        "polish" => array("677275647a6965c584", 1544103703), // december
        "turkish" => array("4172616cc4b16b", 1544103703), // december
        "hungarian" => array("6fc5be756a616b", 1520343439), // march
        "chinese" => array("e58d81e4ba8ce69c88", 1544103703), // december
        "russian" => array("d0b4d095d099d0aed090d09fd0ad", 1544103703) // december
    );
    
    $format = "%B"; // print full month name
    print("<h1>Detecting the font encoding of <code>strftime()</code></h1>\n");
    
    foreach($languages as $language => $data){
        // the hex value in UTF-8, this is the target value
        $hex = $data[0];
        // the timestamp to check
        $timestamp = $data[1];
    
        print(
            "<h2>".$language."</h2>\n".
            "<p>".
                "<code>strftime()</code> for ".formatTime("%B", "english", $timestamp)." ".
                "for language ".$language.". Expected hex:  <code>".$hex."</code>, converted expected ".
                "hex to string: <code>".tostring($hex)."</code>".
            "</p>\n"
        );
    
        // this is a different formatTime() function than mentioned above, it is defined after this 
        // foreach
        $string = formatTime("%B", $language, $timestamp);
    
        print("<table>\n");
        print("<tr>\n".
                "\t<td>initial return value</td>\n".
                "\t<td>".$string."</td>\n".
                "\t<td>".tohex($string)."</td>\n".
            "</tr>\n\n".
            "<tr><td colspan='3'>Encodings that deliver the correct result:</td></tr>"
        );
    
        foreach($encodings as $source_encoding){
            $converted = mb_convert_encoding($string, $internal_encoding, $source_encoding);
            $converted_hex = tohex($converted);
    
            $style = "";
            if($converted_hex == $hex){
                $style = "background: green";
            }
            elseif(!$show_wrong_encodings){
                $style = "display: none";
            }
    
            print("<tr style='".$style.";'>\n".
                    "\t<td>".$source_encoding."</td>\n".
                    "\t<td>".$converted."</td>\n".
                    "\t<td>".$converted_hex."</td>\n".
                "</tr>\n"
            );
        }
        print("</table>");
    }
    
    function tohex($string){
        return implode(unpack("H*", $string));
    }
    
    function tostring($hex){
        return pack("H*", $hex);
    }
    
    function formatTime($format, $language, $timestamp){
        switch($language){
            case 'chinese':
                $locale = setlocale(LC_TIME, 'zh_CN.utf8', 'zh_CN.UTF-8', 'zh_CN', 'zh');
                break;
            case 'hungarian':
                $locale = setlocale(LC_TIME, 'hu_HU.utf8', 'hu_HU.UTF-8', 'hu_HU', 'hr');
                break;
            case 'russian':
                $locale = setlocale(LC_TIME, 'ru_RU.utf8', 'ru_RU.UTF-8', 'ru_RU', 'ru');
                break;
            case 'german':
                $locale = setlocale(LC_TIME, 'de_DE.utf8', 'de_DE.UTF-8', 'de_DE', 'de');
                break;
            case 'french':
                $locale = setlocale(LC_TIME, 'fr_FR.utf8', 'fr_FR.UTF-8', 'fr_FR', 'fr');
                break;
            case 'polish':
                $locale = setlocale(LC_TIME, 'pl_PL.utf8', 'pl_PL.UTF-8', 'pl_PL', 'pl');
                break;
            case 'turkish':
                $locale = setlocale(LC_TIME, 'tr_TR.utf8', 'tr_TR.UTF-8', 'tr_TR', 'tr');
                break;
            // ...
            default:
                $locale = setlocale(LC_TIME, 'en_GB.utf8', 'en_GB.UTF-8', 'en_GB', 'en');
                break;
        }
    
        $datetime = strftime($format, $timestamp);
        setlocale(LC_TIME, $locale);
    
        return $datetime;
    }