Search code examples
unicodeinternationalizationlanguage-lawyersoftware-design

Why if you search for a or A in this very page, you'll not find а nor А?


And even if you search for p or P, you'll not find р nor Р.

Why does Unicode use different codepoints for, say, a/A (latin) а/А (cyrillic)? What is the relevance of this from the and standpoint?

Let me explain where my curiosity lies:

  • a/A are lower/upper case version of the same letter, which is used in
    • italian, where it is pronounced /a/,
    • english, where it is pronounced /a/, /æ/, or /e/ (and probably other shades) depending on the word it appears in,
    • in french, where ...;
  • a graphically identical letter а/А is used in cyrillic, but it is assigned to different code points.

a/A and а/А have the same identical shape, and not an entirely different pronounciation, so why are they not the same code point (well, the same two for upper case and lower case)?

The only reason I can think of, and which occurred to me only now that I ask the question, is that they belong to different alphabets, and the characters of a given alphabet are better if laid out sequentially, e.g. (in C++) assert(u'a' + 1 == u'b') but assert(u'а' + 1 == u'б').

Is that the only true reason? Having alphabets occupy sequential codes?


Solution

  • This is all explained in Unicode Technical Note #26. In short:

    Latin and Cyrillic are different scripts, even if some of their letters look very similar to one another. Visual appearance is not the only factor that defines a character’s identity, and letters are generally never unified across scripts. It would just make the Unicode Standard harder to use for everybody.