I have some texts that has various encoding, For example the following text, has mixed encoding of UTF-8 and ISO-8859-1:
Ù…Øتوای میکس شده و بخش سالم
But I want all these to become UTF-8, that means the sections that has UTF-8 encoding left, and the other strings become UTF-8, for example, above text should be output as:
محتوای میکس شده و بخش سالم
I used different ways, use the iconv
function in PHP and use following class:
https://github.com/neitanod/forceutf8
But none of them gave me the correct output, And always some part of the text become question mark like ???????
.
What is the best way to convert mixed encoding to UTF-8 without any damage?
Edit:
Row bytes of mixed text:
c399e280a6c398c2adc398c2aac399cb86c398c2a7c39bc59220c399e280a6c39bc592c39ac2a9c398c2b320c398c2b4c398c2afc399e280a120d98820d8a8d8aed8b420d8b3d8a7d984d985
Correct text:
محتوای میکس شده و بخش سالم
Part of your string is Windows-1252 mojibake, meaning at some point a UTF-8 string was interpreted as Windows-1252 and converted from that wrong assumption to UTF-8. That can be reversed by transcoding the string from UTF-8 to Windows-1252, which results in the correct UTF-8 sequence of the original. To apply that to only the subset of the text that is messed up, you can use a regex to, for instance, apply the transformation to only non-Arabic parts of the text:
// sample data
$str_hex = 'c399e280a6c398c2adc398c2aac399cb86c398c2a7c39bc59220c399e280a6c39bc592c39ac2a9c398c2b320c398c2b4c398c2afc399e280a120d98820d8a8d8aed8b420d8b3d8a7d984d985';
// actual string
$str = hex2bin($str_hex);
echo 'Messed up: ', $str, PHP_EOL; // Ù…Øتوای میکس شده و بخش سالم
$fixed = preg_replace_callback(
'/\\P{Arabic}+/u', // matches non-Arabic sequences
function (array $m) { return iconv('UTF-8', 'Windows-1252', $m[0]); },
$str
);
echo 'Fixed: ', $fixed; // محتوای میکس شده و بخش سالم