Very simple question to understand, but maybe not simple to answer.
There are 0x110000 different code points in Unicode.
"Case folding" is a lossy operation one can perform on a string in order to get a representation of that string suitable for case-insensitive comparison to other strings. This is analogous, in English, to changing all your strings to lowercase before sorting them (so all the ones that start with capital letters don't end up at the front!), except the case-folding operation doesn't operate with respect to any one language's case rules (and it's therefore suitable only for internal operations, not display to users).
There are certain code points (I think) which will not appear in any case-folded string. I want to know, as precisely as is possible, how many of these there are.
There are several versions of the case folding algorithm (and the algorithm is customisable for different languages/contexts), but when using the algorithm as specified in the Unicode Standard: any code point that has a mapping in CaseFolding.txt
(with at least a C
status, and either S
or F
) cannot appear in a case-folded string.
For true case-insensitive comparisons, characters that have an NFKC normalisation form or the Default_Ignorable_Code_Point
property will also be replaced and cannot appear. This is the set of characters with an NFKC_Casefold
mapping in DerivedNormalizationProps.txt
-- a total of 10,146 code points.