Search code examples
unicodeencodingterminology

Is the statement "Unicode encoding" accurate?


Many tutorials mention the term "Unicode encoding", but I feel this is inappropriate.

From my understanding, Unicode is just a character set, not an encoding. The so-called encoding should be a storage method of Unicode characters, such as UTF-8 encoding, GBK encoding, GB2312 encoding, etc. They encode characters into byte sequences, and different encoding rules can result in different byte sequences. So I don't think the statement "Unicode encoding" is accurate.


Solution

  • You are correct. Unicode is a set of systems, including a list of code points, and multiple encodings that transform a sequence of code points to a series of code units (e.g. bytes for UTF-8, 16-bit values for UTF-16, etc.).

    However, historically, Unicode started as purely a 16-bit encoding. Quoting from Wikipedia, quoting from the original 1988 proposal:

    Unicode is intended to address the need for a workable, reliable world text encoding. Unicode could be roughly described as "wide-body ASCII" that has been stretched to 16 bits to encompass the characters of all the world's living languages. In a properly engineered design, 16 bits per character are more than sufficient for this purpose.

    The original intent was to only encode characters in common, modern usage, which the spec assumed would be well served with 16 bits. However, this was a short-sighted assumption, and Unicode 2.0 expanded the code point range by introducing the "surrogate pair" system into the 16-bit encoding. This retroactively named the Unicode 1.0 (then just called "Unicode") encoding as "UCS-2", and the new 16-bit encoding with surrogates was named "UTF-16".

    Due to these historical factors, several systems and documentation may still say "Unicode" when they really mean "UTF-16" or "UCS-2". Windows is a common example, I'm pretty sure I've seen a number of places that say "Unicode" but produce UTF-16 Little Endian.