Search code examples
unicodenormalizationunicode-normalizationcase-folding

Maximum length of a string after performing unicode casefolding


I need to perform casefolding on a set of strings, and must ensure beforehand that they will not exceed a given length after this is done (to hard-code the needed buffer size). The problem is that a string length (in code points) may change after casefolding is applied. See, e.g., in Python3:

>>> "süß".casefold()
'süss'

Now, the maximum number of code points a string may contain after performing casefolding can be computed easily:

>>> max(len(chr(s).casefold()) for s in range(0x10FFFF + 1))
3

But is it valid in all cases? I mean, is it possible that the sequence of code points (the order in which they appear) might affect the final length of the string, due to some arcane property of Unicode? Or can I assume that the final string will always be at most 3 times longer than the original?


Solution

  • The Unicode standard defines casefolding as follows:

    toCasefold(X): Map each character C in X to Case_Folding(C).

    So every character in a string is casefolded regardless of context and the results are concatenated. This means that your assumption is correct: A casefolded string is guaranteed to have at most three times the number of code points of the original.