Search code examples
phpunicodeutf-8iso-8859-1

Mixed encoding and make everything UTF-8


I have some texts that has various encoding, For example the following text, has mixed encoding of UTF-8 and ISO-8859-1:

محتوای میکس شده و بخش سالم

But I want all these to become UTF-8, that means the sections that has UTF-8 encoding left, and the other strings become UTF-8, for example, above text should be output as:

محتوای میکس شده و بخش سالم

I used different ways, use the iconv function in PHP and use following class:

https://github.com/neitanod/forceutf8

But none of them gave me the correct output, And always some part of the text become question mark like ???????.

What is the best way to convert mixed encoding to UTF-8 without any damage?

Edit:

Row bytes of mixed text:

c399e280a6c398c2adc398c2aac399cb86c398c2a7c39bc59220c399e280a6c39bc592c39ac2a9c398c2b320c398c2b4c398c2afc399e280a120d98820d8a8d8aed8b420d8b3d8a7d984d985

Correct text:

محتوای میکس شده و بخش سالم

Solution

  • Part of your string is Windows-1252 mojibake, meaning at some point a UTF-8 string was interpreted as Windows-1252 and converted from that wrong assumption to UTF-8. That can be reversed by transcoding the string from UTF-8 to Windows-1252, which results in the correct UTF-8 sequence of the original. To apply that to only the subset of the text that is messed up, you can use a regex to, for instance, apply the transformation to only non-Arabic parts of the text:

    // sample data
    $str_hex = 'c399e280a6c398c2adc398c2aac399cb86c398c2a7c39bc59220c399e280a6c39bc592c39ac2a9c398c2b320c398c2b4c398c2afc399e280a120d98820d8a8d8aed8b420d8b3d8a7d984d985';
    // actual string
    $str = hex2bin($str_hex);
    
    echo 'Messed up: ', $str, PHP_EOL;  // محتوای میکس شده و بخش سالم
    
    $fixed = preg_replace_callback(
        '/\\P{Arabic}+/u',  // matches non-Arabic sequences
        function (array $m) { return iconv('UTF-8', 'Windows-1252', $m[0]); }, 
        $str
    );
    
    echo 'Fixed: ', $fixed;  // محتوای میکس شده و بخش سالم