Search code examples
rubyunicodenormalizationunicode-normalizationgrapheme

Split Unicode entities by graphemes


"d̪".chars.to_a

gives me

["d"," ̪"]

How do I get Ruby to split it by graphemes?

["d̪"]

Solution

  • Edit: As @michau's answer notes, Ruby 2.5 introduced the grapheme_clusters method, as well as each_grapheme_cluster if you just want to iterate/enumerate without necessarily creating an array.


    In Ruby 2.0 or above you can use str.scan /\X/

    > "d̪".scan /\X/
    => ["d̪"]
    > "d̪d̪d̪".scan /\X/
    => ["d̪", "d̪", "d̪"]
    
    # Let's get crazy:
    
    
    > str = 'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞'
    
    
    > str.length
    => 75
    > str.scan(/\X/).length
    => 6
    

    If you want to match the grapheme boundaries for any reason, you can use (?=\X) in your regex, for instance:

    > "d̪".split /(?=\X)/
    => ["d̪"]
    

    ActiveSupport (which is included in Rails) also has a way if you can't use \X for some reason:

    ActiveSupport::Multibyte::Unicode.unpack_graphemes("d̪").map { |codes| codes.pack("U*") }