I'm building an API for a mobile application and I seem to have a problem with counting the length of a string containing emojis. My code:
$str = "👍🏿✌🏿️ @mention";
printf("strlen: %d" . PHP_EOL, strlen($str));
printf("mb_strlen UTF-8: %d" . PHP_EOL, mb_strlen($str, "UTF-8"));
printf("mb_strlen UTF-16: %d" . PHP_EOL, mb_strlen($str, "UTF-16"));
printf("iconv UTF-16: %d" . PHP_EOL, iconv_strlen(iconv("UTF-8", "UTF-16", $str)));
printf("iconv UTF-16: %d" . PHP_EOL, iconv_strlen(iconv("ISO-8859-1", "UTF-16", $str)));
the response of this is:
strlen: 27
mb_strlen UTF-8: 14
mb_strlen UTF-16: 13
iconv UTF-16: 14
iconv UTF-16: 27
however i should get 17 as the result. We tried to cound the string length on iOS, android and windows phone, it's 17 everywhere. iOS (swift) snippet:
var str = "👍🏿✌🏿️ @mention"
(str as NSString).length // 17
count(str) // 13
count(str.utf16) // 17
count(str.utf8) // 27
We need to use the NSString because of a library. I need this to get the starting and ending position of the "@mention". If the string contains only text or only emojis, it works fine so probably there is some issue with mixed content.
What am i doing wrong? What other info can I provide you guys to get me in the right direction?
Thanks!
Your functions are all counting different things.
Graphemes: 👍 🏿 ✌ 🏿️ @ m e n t i o n 13
----------- ----------- -------- --------------------- ------ ------ ------ ------ ------ ------ ------ ------ ------
Code points: U+1F44D U+1F3FF U+270C U+1F3FF U+FE0F U+0020 U+0040 U+006D U+0065 U+006E U+0074 U+0069 U+006F U+006E 14
UTF-16 code units: D83D DC4D D83C DFFF 270C D83C DFFF FE0F 0020 0040 006D 0065 006E 0074 0069 006F 006E 17
UTF-16-encoded bytes: 3D D8 4D DC 3C D8 FF DF 0C 27 3C D8 FF DF 0F FE 20 00 40 00 6D 00 65 00 6E 00 74 00 69 00 6F 00 6E 00 34
UTF-8-encoded bytes: F0 9F 91 8D F0 9F 8F BF E2 9C 8C F0 9F 8F BF EF B8 8F 20 40 6D 65 6E 74 69 6F 6E 27
PHP strings are natively bytes.
strlen()
counts the number of bytes in a string: 27.
mb_strlen(..., 'utf-8')
counts the number of code points (Unicode characters) in a string when its bytes are decoded to characters using the UTF-8 encoding: 14.
(The other example counts are largely meaningless as they're based on treating the input string as one encoding when actually it contains data in a different encoding.)
NSStrings are natively counted as UTF-16 code units. There are 17, not 14, because the above string contains characters like 👍
that don't fit in a single UTF-16 code unit, so have to be encoded as a surrogate pair. There aren't any functions that will count strings in UTF-16 code units in PHP, but because each code unit is encoded to two bytes, you can work it out easily enough by encoding to UTF-16 and dividing the number of bytes by two:
strlen(iconv('utf-8', 'utf-16le', $str)) / 2
(Note: the le
suffix is necessary to make iconv
encode to a particular endianness of UTF-16, and not foul up the count by choosing one and adding a BOM to the start of the string to say which one it chose.)