"d̪".chars.to_a
gives me
["d"," ̪"]
How do I get Ruby to split it by graphemes?
["d̪"]
Edit: As @michau's answer notes, Ruby 2.5 introduced the grapheme_clusters
method, as well as each_grapheme_cluster
if you just want to iterate/enumerate without necessarily creating an array.
In Ruby 2.0 or above you can use str.scan /\X/
> "d̪".scan /\X/
=> ["d̪"]
> "d̪d̪d̪".scan /\X/
=> ["d̪", "d̪", "d̪"]
# Let's get crazy:
> str = 'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞'
> str.length
=> 75
> str.scan(/\X/).length
=> 6
If you want to match the grapheme boundaries for any reason, you can use (?=\X)
in your regex, for instance:
> "d̪".split /(?=\X)/
=> ["d̪"]
ActiveSupport (which is included in Rails) also has a way if you can't use \X
for some reason:
ActiveSupport::Multibyte::Unicode.unpack_graphemes("d̪").map { |codes| codes.pack("U*") }