Search code examples
phpunicodemultibytemultibyte-functions

Reliably rotating any string


I was experimenting with multibyte strings and how to handle them. Using the code that you can see here

https://gist.github.com/charlydagos/89f67808e01f97e6de91

I was successful in rotating most strings. However I noticed that the line

$chr = mb_substr($str, $i, 1);

Will not work for flag emojis, since they use more than a single unicode code point.

You can try the following in your own shells:

This gives desired output: $ php string_rotate_mb.php "ไฝ ๅฅฝ"

This however $ php string_rotate_mb.php "๐Ÿ‡จ๐Ÿ‡ญ" returns [H][C]

Which is technically correct, it did rotate the string. But really it's single glyph and my desired output is the flag alone (or a sequence of flags, which then becomes even more garbled glyphs, sometimes even turning it into different flags).

How can I, then, reliably determine that I should grab a $length = 1 or a $length = 2 (or a $length = N) substring using mb_substr?

For reference, I'm using PHP 7.0.2 (cli) (built: Jan 7 2016 10:40:26) ( NTS ), ZSH_VERSION = 5.2, LC_ALL=en_us.utf-8, and iTerm2: Build 2.9.git.8dff8db518.

Update - Feb 5th 2016

Solution: https://gist.github.com/charlydagos/6755ad994da07a7b4959#file-string_rotate_working-php-L39-L56

Thank you roeland for introducing the concept of Grapheme Clusters. Good info also in the following links


Solution

  • There are a lot more examples where this fails:

    • Composing characters: compare eฬ‚ and รช (the first one is actually U+0302 and U+0065)

    • Variants: eg. emoji can have a black/white or color variant ๐ŸŽ‚๏ธŽ vs ๐ŸŽ‚๏ธ. This is done by adding a variant selector after the emoji. similar problem with ethnic variations: ๐Ÿ™Œ๐Ÿป ๐Ÿ™Œ๐Ÿผ ๐Ÿ™Œ๐Ÿฝ ๐Ÿ™Œ๐Ÿพ ๐Ÿ™Œ๐Ÿฟ. (note: support for this is a bit spotty, but at least Windows 10 supports these variants)

    • Flags, which consist of two code points.

    • Fractions using the Fraction dash (U+2044) may be rendered with one glyph as well. Eg. 1โ„2. Note the difference with 1/2

    And so onโ€ฆ

    I think what you're looking for is called grapheme clusters. Without library support I think this is pretty difficult to get right.

    For recent PHP versions there is the intl extension. You may loop over the clusters using the grapheme functions.