Search code examples
phpunicodeunicode-string

PHP - length of string containing emojis/special chars


I'm building an API for a mobile application and I seem to have a problem with counting the length of a string containing emojis. My code:

$str = "👍🏿✌🏿️ @mention";

printf("strlen: %d" . PHP_EOL, strlen($str));
printf("mb_strlen UTF-8: %d" . PHP_EOL, mb_strlen($str, "UTF-8"));
printf("mb_strlen UTF-16: %d" . PHP_EOL, mb_strlen($str, "UTF-16"));
printf("iconv UTF-16: %d" . PHP_EOL, iconv_strlen(iconv("UTF-8", "UTF-16", $str)));
printf("iconv UTF-16: %d" . PHP_EOL, iconv_strlen(iconv("ISO-8859-1", "UTF-16", $str)));

the response of this is:

strlen: 27
mb_strlen UTF-8: 14
mb_strlen UTF-16: 13
iconv UTF-16: 14
iconv UTF-16: 27

however i should get 17 as the result. We tried to cound the string length on iOS, android and windows phone, it's 17 everywhere. iOS (swift) snippet:

var str = "👍🏿✌🏿️ @mention"
(str as NSString).length // 17
count(str) // 13
count(str.utf16) // 17
count(str.utf8) // 27

We need to use the NSString because of a library. I need this to get the starting and ending position of the "@mention". If the string contains only text or only emojis, it works fine so probably there is some issue with mixed content.

What am i doing wrong? What other info can I provide you guys to get me in the right direction?

Thanks!


Solution

  • Your functions are all counting different things.

    Graphemes:                 👍            🏿          ✌                🏿️                  @      m      e      n      t      i      o      n     13
                          -----------  -----------  --------  ---------------------  ------ ------ ------ ------ ------ ------ ------ ------ ------
    Code points:            U+1F44D      U+1F3FF     U+270C     U+1F3FF     U+FE0F   U+0020 U+0040 U+006D U+0065 U+006E U+0074 U+0069 U+006F U+006E  14
    UTF-16 code units:     D83D DC4D    D83C DFFF     270C     D83C DFFF     FE0F     0020   0040   006D   0065   006E   0074   0069   006F   006E   17
    UTF-16-encoded bytes: 3D D8 4D DC  3C D8 FF DF   0C 27    3C D8 FF DF   0F FE    20 00  40 00  6D 00  65 00  6E 00  74 00  69 00  6F 00  6E 00   34
    UTF-8-encoded bytes:  F0 9F 91 8D  F0 9F 8F BF  E2 9C 8C  F0 9F 8F BF  EF B8 8F    20     40     6D     65     6E     74     69     6F     6E    27
    

    PHP strings are natively bytes.

    strlen() counts the number of bytes in a string: 27.

    mb_strlen(..., 'utf-8') counts the number of code points (Unicode characters) in a string when its bytes are decoded to characters using the UTF-8 encoding: 14.

    (The other example counts are largely meaningless as they're based on treating the input string as one encoding when actually it contains data in a different encoding.)

    NSStrings are natively counted as UTF-16 code units. There are 17, not 14, because the above string contains characters like 👍 that don't fit in a single UTF-16 code unit, so have to be encoded as a surrogate pair. There aren't any functions that will count strings in UTF-16 code units in PHP, but because each code unit is encoded to two bytes, you can work it out easily enough by encoding to UTF-16 and dividing the number of bytes by two:

    strlen(iconv('utf-8', 'utf-16le', $str)) / 2
    

    (Note: the le suffix is necessary to make iconv encode to a particular endianness of UTF-16, and not foul up the count by choosing one and adding a BOM to the start of the string to say which one it chose.)