I have some raw text that is usually a valid UTF-8 string. However, every now and then it turns out that the input is in fact a CESU-8 string, instead. It is possible to technically detect this and convert to UTF-8 but as this happens rarely, I would rather not spend lots of CPU time to do this.
Is there any fast method to detect if a string is encoded with CESU-8 or UTF-8? I guess I could always blindly convert "UTF-8" to UTF-16LE and then to UTF-8 using iconv()
and I would probably get the correct result every time because CESU-8 is close enough to UTF-8 for this to work. Can you suggest anything faster? (I'm expecting the input string to be CESU-8 instead of valid UTF-8 around 0.01-0.1% of all string occurrences.)
(CESU-8 is a non-standard string format which contains 16-bit surrogate pairs encoded in UTF-8. Technically UTF-8 strings should contain the characters represented by those surrogate pairs, not the surrogate pairs itself.)
Here's a more efficient version of your conversion function:
$regex = '@(\xED[\xA0-\xAF][\x80-\xBF]\xED[\xB0-\xBF][\x80-\xBF])@';
$s = preg_replace_callback($regex, function($m) {
$in = unpack("C*", $m[0]);
$in[2] += 1; // Effectively adds 0x10000 to the codepoint.
return pack("C*",
0xF0 | (($in[2] & 0x1C) >> 2),
0x80 | (($in[2] & 0x03) << 4) | (($in[3] & 0x3C) >> 2),
0x80 | (($in[3] & 0x03) << 4) | ($in[5] & 0x0F),
$in[6]
);
}, $s);
The code only converts high surrogates followed by low surrogates, and converts the two three-byte CESU-8 sequences directly into a four-byte UTF-8 sequence, i.e. from
ED A0-AF 80-BF ED B0-BF 80-BF
11101101 1010aaaa 10bbbbbb 11101101 1011cccc 10dddddd
to
F0-F4 80-BF 80-BF 80-BF
11110oaa 10aabbbb 10bbcccc 10dddddd // o is "overflow" bit
Here's an online example.