The Unicode standard defines a grapheme cluster as an algorithmic approximation to a "user-perceived character". A grapheme cluster more or less corresponds to what people think of as a single "character" in text. Therefore it is a natural and important requirement in programming to be able to operate on strings as sequences of grapheme clusters.
The best general-purpose grapheme cluster definition is the extended grapheme cluster; there are other grapheme cluster algorithms (a tailored grapheme cluster) meant for specific localized usages.
In Crystal, how can I iterate over (or otherwise operate on) a String
as a sequence of grapheme clusters?
This answer is based on a thread in the Crystal forum.
Crystal does not have a built-in way to do this (unfortunately) as of 1.0.0.
However, the regex engine in Crystal does, with the \X
pattern which matches a single extended grapheme cluster:
"\u0067\u0308\u1100\u1161\u11A8".scan(/\X/) do |match|
grapheme = match[0]
puts grapheme
end
# Output:
# g̈
# 각
You can wrap this up in a nicer API as follows:
def each_grapheme(s : String, &)
s.scan(/\X/) do |match|
yield match[0]
end
end
def graphemes(s : String) : Array(String)
result = Array(String).new
each_grapheme(s) do |g|
result << g
end
return result
end
# Example from https://docs.swift.org/swift-book/LanguageGuide/StringsAndCharacters.html
s = "\u{E9}\u{65}\u{301}\u{D55C}\u{1112}\u{1161}\u{11AB}"
each_grapheme(s) do |g|
puts "#{g}\t#{g.codepoints}"
end
# Output:
# é [233]
# é [101, 769]
# 한 [54620]
# 한 [4370, 4449, 4523]