Search code examples
unicodeunicode-string

what is meant by the notation "U+" when discussing Unicode encoding?


i realize this is pretty basic, as i am reading about Unicode in Wikipedia and wherever it points. but this "U+0000" semantic is not completely explained. it appears to me that "U" always equals 0.

why is that "U+" part of the notation? what exactly does it mean? (it appears to be some base value, but i cannot understand when or why it is ever non-zero.)

also, if i receive a string of text from some other source, how do i know if that string is encoded UTF-8 or UTF-16 or UTF-32? is there some way i can automatically determine that by context?


Solution

    1. From Wikipedia, article Unicode, section Architecture and Terminology:

      Unicode defines a codespace of 1,114,112 code points in the range 0 to 10FFFF (hexadecimal). Normally a Unicode code point is referred to by writing "U+" followed by its hexadecimal number. For code points in the Basic Multilingual Plane (BMP), four digits are used (e.g., U+0058 for the character LATIN CAPITAL LETTER X); for code points outside the BMP, five or six digits are used.

      This convention was introduced so that the readers understand that the code point is specifically a Unicode code point. For example, the letter ă (LATIN SMALL LETTER A WITH BREVE) is U+0103; in Code Page 852 it has the code 0xC7, in Code Page 1250 it has the code 0xE3, but when I write U+0103 everybody understands that I mean the Unicode code point and they can look it up.

    2. For languages written with the Latin alphabet, UTF-16 and UTF-32 strings will most likely contain lots and lots of bytes with the value 0, which should not appear in UTF-8 encoded strings. By looking at which bytes are zero you can also infer the byte order of UTF-16 and UTF-32 strings, even in the absence of a Byte Order Mark.

      So for example if you get the bytes

       0xC3 0x89 0x70 0xC3 0xA9 0x65
      

      this is most likely Épée in UTF-8 encoding. In little-endian UTF-16 this would be

       0x00 0xC9 0x00 0x70 0x00 0xE9 0x00 0x65
      

      (Note how every even-numbered byte is zero.)