Search code examples
unicodeucsnoncharacter

Which nonnegative integers aren't assigned a character in the UCS?


Coded character sets, as defined by the Unicode Character Encoding Model, map characters to nonnegative integers (e.g. LATIN SMALL LETTER A to 97, both by traditional ASCII and the UCS).

Note: There's a difference between characters and abstract characters: the latter term more closely refers to our notion of character, while the first is a concept in the context of coded character sets. Some abstract characters are represented by more than one character. The Unicode article at Wikipedia cites an example:

For example, a Latin small letter "i" with an ogonek, a dot above, and an acute accent [an abstract character], which is required in Lithuanian, is represented by the character sequence U+012F, U+0307, U+0301.

The UCS (Universal Coded Character Set) is a coded character set defined by the International Standard ISO/IEC 10646, which, for reference, may be downloaded through this official link.

The task at hand is to tell whether a given nonnegative integer is mapped to a character by the UCS, the Universal Coded Character Set.

Let us consider first the nonnegative integers that are not assigned a character, even though they are, in fact, reserved by the UCS. The UCS (§ 6.3.1, Classification, Table 1; page 19 of the linked document) lists three possibilities, based on the basic type that corresponds to them:

  • surrogate (the range D800–DFFF)
  • noncharacter (the range FDD0–FDEF plus any code point ending in the value FFFE or FFFF)

    The Unicode standard defines noncharacters as follows:

    Noncharacters are code points that are permanently reserved and will never have characters assigned to them.

    This page lists noncharacters more precisely.

  • reserved (I haven't found which nonnegative integers belong to this category)

On the other hand, code points whose basic type is any of:

  • graphic
  • format
  • control
  • private use

are assigned to characters. This is, however, open to discussion. For instance, should private use code points be considered to actually be assigned any characters? The very UCS (§ 6.3.5, Private use characters; page 20 of the linked document) defines them as:

Private use characters are not constrained in any way by this International Standard. Private use characters can be used to provide user-defined characters.

Additionally, I would like to know the range of nonnegative integers that the UCS maps or reserves. What is the maximum value? In some pages I have found that the whole range of nonnegative integers that the UCS maps is –presumably– 0–0x10FFFF. Is this true?

Ideally, this information would be publicly offered in a machine-readable format that one could build algorithms upon. Is it, by chance?


For clarity: What I need is a function that takes a nonnegative integer as argument and returns whether it is mapped to a character by the UCS. Additionally, I would prefer that it were based on official, machine-readable information. To answer this question, it would be enough to point to one such resource that I could build the function myself upon.


Solution

  • The Unicode Character Database (UCD) is available on the unicode.org site; it is certainly machine-readable. It contains a list of all of the assigned characters. (Of course, the set of assigned codepoints is larger with every new version of Unicode.) Full documentation on the various files which make up the UCD is also linked from the UCD page.

    The range of potential codes is, as you suspect, 0-0x10FFFF. Of those, the non-characters and the surrogate blocks will never be assigned as codepoints to any character. Codes in the private use areas can be assigned to characters only by mutual agreement between applications; they will never be assigned to characters by Unicode itself. Any other code might be.