Search code examples
unicodecase-insensitivecase-folding

Why is upper casing not enough for case-insensitive comparison?


To compare two strings case insensitively, one correct way is to case fold them first. How is this better than upper casing or lower casing?

I find examples where lower casing doesn't work right online. For example "σ" and "ς" (two forms of "Σ") don't become the same when converted to lower case. But I've failed to find why case folding is better than mapping to upper case. Is there a case where two strings that should match case insensitively don't upper case to the same strings?

Another scenario is when I want to store a case insensitive index. The recommended way seems to be case folding and then normalizing. What are its advantages over storing the string mapped to upper case and normalized? The specs say mapping to upper case is not guaranteed to be stable across versions of Unicode while case folding is. But are there any cases where mapping to upper case gives a different string in an earlier version of Unicode?


Solution

  • As per Unicode stability policy, case mappings are only stable for case pairs, i.e. pairs of characters X and Y where X is the full uppercase mapping of Y, and Y is the full lowercase mapping of X. Only when both these characters exist with these properties is the casing relation between them set in stone.

    However, Unicode contains many “incomplete” case pairs where only the lowercase form has been encoded and the uppercase form is missing completely. This is usually the case for letters used in transcription systems that are traditionally lowercase-only. Should capital forms be discovered and subsequently added to Unicode, these letters would then receive a new uppercase mapping.

    The most recent characters this has happened to are “ʂ” (from Unicode 1.1), “ᶎ” (from Unicode 4.1), and “ꞔ” (from Unicode 7.0), which all got brand new uppercase forms (Ꞔ, Ʂ, Ᶎ) in Unicode 12.0 two years ago.

    Because case mappings do not have to be unique, this makes uppercasing a poor substitute for proper case-folding. For example, both U+0434 (д) and U+1C81 (ᲁ) uppercase to U+0414 (Д), but only the former is locked into a case pair by virtue of being U+0414’s full lowercase mapping. If someone were to find a dedicated capital letter version of U+1C81 in some old manuscript, it would be given a new uppercase mapping, resulting in U+0434 and U+1C81 suddenly no longer comparing equal under that operation.

    EDIT: I have just remembered a current example of uppercasing not being sufficient for case-insensitive matching: U+1E9E (ẞ) is already a capital letter and thus uppercases to itself. Its lowercase counterpart is U+00DF (ß), but the uppercase mapping of U+00DF is the sequence <U+0053, U+0053> (SS).

    uppercase("ẞ") ≠ uppercase(lowercase("ẞ"))