I have Ruby 3.3.4 installed on MacOS 14.6.1.
Suppose I have this string in the shell:
$ st="0😀2☺️4🤪6🥳8🥸"
$ echo "$st"
0😀2☺️4🤪6🥳8🥸
If I now feed that string into Ruby, I get the second emoji broken into constituent parts:
$ echo "$st" | ruby -lne 'p $_.split("")'
["0", "😀", "2", "☺", "️", "4", "🤪", "6", "🥳", "8", "🥸"]
^ ^ # should be ONE grapheme
Same if I read that string from a file:
$ cat wee_file
0😀2☺️4🤪6🥳8🥸
$ ruby -lne 'p $_.split("")' wee_file
["0", "😀", "2", "☺", "️", "4", "🤪", "6", "🥳", "8", "🥸"]
Same thing in IRB:
irb(main):001> File.open('/tmp/wee_file').gets.split("")
=> ["0", "😀", "2", "☺", "️", "4", "🤪", "6", "🥳", "8", "🥸", "\n"]
But if I replace ☺️ with another emoji (which is also multibyte) the issue goes away:
$ st2="0😀2🐱4🤪6🥳8🥸"
$ echo "$st2" | ruby -lne 'p $_.split("")'
["0", "😀", "2", "🐱", "4", "🤪", "6", "🥳", "8", "🥸"]
# also from a file and also in IRB..
Any idea why the emoji ☺️ is producing this result?
It's because ☺️ is composed of two characters:
☺
U+263A (White Smiling Face)◌️
U+FE0F (Variation Selector-16)The latter is used to to request an emoji presentation for the preceding character.
"☺️".codepoints.map { |c| c.to_s(16) }
#=> ["263a", "fe0f"]
You can get the expected result via grapheme_clusters
or enumerate them via each_grapheme_cluster
:
"0😀2☺️4🤪6🥳8🥸".grapheme_clusters
#=> ["0", "😀", "2", "☺️", "4", "🤪", "6", "🥳", "8", "🥸"]