Search code examples
crystal-langgraphemegrapheme-cluster

How to iterate over grapheme clusters in Crystal?


The Unicode standard defines a grapheme cluster as an algorithmic approximation to a "user-perceived character". A grapheme cluster more or less corresponds to what people think of as a single "character" in text. Therefore it is a natural and important requirement in programming to be able to operate on strings as sequences of grapheme clusters.

The best general-purpose grapheme cluster definition is the extended grapheme cluster; there are other grapheme cluster algorithms (a tailored grapheme cluster) meant for specific localized usages.

In Crystal, how can I iterate over (or otherwise operate on) a String as a sequence of grapheme clusters?


Solution

  • This answer is based on a thread in the Crystal forum.

    Crystal does not have a built-in way to do this (unfortunately) as of 1.0.0.

    However, the regex engine in Crystal does, with the \X pattern which matches a single extended grapheme cluster:

    "\u0067\u0308\u1100\u1161\u11A8".scan(/\X/) do |match|
      grapheme = match[0]
      puts grapheme
    end
    
    # Output:
    # g̈
    # 각
    

    Run it online

    You can wrap this up in a nicer API as follows:

    def each_grapheme(s : String, &)
      s.scan(/\X/) do |match|
        yield match[0]
      end
    end
    
    def graphemes(s : String) : Array(String)
      result = Array(String).new
      each_grapheme(s) do |g|
        result << g
      end
      return result
    end
    
    # Example from https://docs.swift.org/swift-book/LanguageGuide/StringsAndCharacters.html
    s = "\u{E9}\u{65}\u{301}\u{D55C}\u{1112}\u{1161}\u{11AB}"
    each_grapheme(s) do |g|
      puts "#{g}\t#{g.codepoints}"
    end
    
    # Output:
    # é [233]
    # é    [101, 769]
    # 한 [54620]
    # 한   [4370, 4449, 4523]
    

    Run it online