Search code examples
phpmysqlencodingutf-8windows-1251

Convert from mysql cp1251_general_ci collation (Windows-1251) into UTF-8 php


I have a mysql varchar(50) row in cp1251_general_ci collation. After mysql_fetch_row in php i got a $string. Then i do the following:

echo mb_detect_encoding($string,'CP1251,UTF-8,Windows-1251'); // echoes Windows-1251
$string = mb_convert_encoding($string, 'UTF-8', 'Windows-1251');
echo mb_detect_encoding($string,'CP1251,UTF-8,Windows-1251'); // again echoes Windows-1251

Why the second time the string is not UTF-8?

I also tried

$string = iconv('Windows-1251', 'UTF-8', $string);

But again the out charset is Windows-1251.

And in the final result i got broken encoding in my filename which consists of $string variable.

How can i convert from mysql cp1251_general_ci collation (Windows-1251) into UTF-8?

P.S.

echo $string; \\ echoes ������
echo bin2hex($string); \\ echoes cce5e3e0f4eeed
$string = mb_convert_encoding($string, 'UTF-8', 'Windows-1251');
echo $string; \\ echoes Мегафон
echo bin2hex($string); \\ echoes  d09cd0b5d0b3d0b0d184d0bed0bd

But

fopen("../tmp/$string.log", "w");

creates a file .../tmp/??????????????.log (in linux)


Solution

  • Found the reason of this strange situation!

    In short words: if you see a proper encoded UTF-8 string on a server (in terminal) in unreadable symbols — check the server locale. And if you see a strange behavior of the mb_detect_encoding() method, don't forget that — mb_detect_encoding doesn't give you a precise encoding determination of a string.

    The reason of not correct encoding in filename: .../tmp/??????????????.log file is the locale on the server! Here is the locale command result on the server where the file is located:

    $ locale
    LANG=
    LC_CTYPE="C"
    LC_COLLATE="C"
    LC_TIME="C"
    LC_NUMERIC="C"
    LC_MONETARY="C"
    LC_MESSAGES="C"
    LC_ALL=
    

    For correct displaying UFT-8 symbols in file names on the server the server locale must be utf-8 too.

    And about all the converting in the question. Both methods:

    iconv('Windows-1251', 'UTF-8', $string);
    

    and

    mb_convert_encoding($string, 'UTF-8', 'Windows-1251');
    

    works fine in this case.

    The only question is why the second echo of

    echo mb_detect_encoding($string,'CP1251,UTF-8,Windows-1251'); // echoes Windows-1251
    $string = mb_convert_encoding($string, 'UTF-8', 'Windows-1251');
    echo mb_detect_encoding($string,'CP1251,UTF-8,Windows-1251'); // again echoes Windows-1251
    

    is not UTF-8?

    And the answer is — mb_detect_encoding doesn't give you a precise encoding determination of a string