Search code examples
c++utficubreakiteratorgrapheme-cluster

Maximum number of codepoints in a grapheme cluster


I am using the C++ ICU library. I wish to split a utf-8 string into approximately equal chunks. However, I want the chunks to be demarcated at grapheme cluster boundaries. I do not wish to convert my entire string into utf-16 to do this for both memory and speed efficiency. Instead, I want to translate a small number of utf-8 codepoints close to my estimated chunk boundaries into utf-16. I can then use ICU's BreakIterator to work out the exact boundaries.

Is there a hard upper limit of the number of codepoints that can make up a grapheme cluster? If so, what is it? I need to know this in order to determine the minimum codepoints that I need to translate from utf-8 to utf-16.


Solution

  • Is there a hard upper limit of the number of codepoints that can make up a grapheme cluster?

    No. There is no hard upper limit for how many code points a grapheme clusters - i.e. a user-perceived character - consists of.

    You could for example repeatedly add ZERO WIDTH JOINER with a joined character.