ruby unicode normalization unicode-normalization grapheme

Split Unicode entities by graphemes

"d̪".chars.to_a

gives me

["d"," ̪"]

How do I get Ruby to split it by graphemes?

["d̪"]

Solution

Edit: As @michau's answer notes, Ruby 2.5 introduced the grapheme_clusters method, as well as each_grapheme_cluster if you just want to iterate/enumerate without necessarily creating an array.

In Ruby 2.0 or above you can use str.scan /\X/

> "d̪".scan /\X/
=> ["d̪"]
> "d̪d̪d̪".scan /\X/
=> ["d̪", "d̪", "d̪"]

# Let's get crazy:


> str = 'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞'


> str.length
=> 75
> str.scan(/\X/).length
=> 6

If you want to match the grapheme boundaries for any reason, you can use (?=\X) in your regex, for instance:

> "d̪".split /(?=\X)/
=> ["d̪"]

ActiveSupport (which is included in Rails) also has a way if you can't use \X for some reason:

ActiveSupport::Multibyte::Unicode.unpack_graphemes("d̪").map { |codes| codes.pack("U*") }