Search code examples
c++windowsunicodeutf-16

Wide character Windows


Windows defines the wchar_t symbol to be 16 bits long. However, the UTF-16 encoding used tells us that some symbols may actually be encoded with 4 bytes (32 bits).

Does this mean that if I'm developing an application for Windows, the following statement:

wchar_t symbol = ... // Whatever

might only represent a part of the actual symbol?


And what will happen if I do the same under *nix, where wchar_t is 32 bits long?


Solution

  • Yes, it means that symbol may hold a part of a surrogate pair on Windows. On *nixes wchar_t is 32 bit long and will hold the whole Unicode character set. Note that a Unicode code-point doesn't represent a character, since some characters may be encoded by more than one Unicode code-point, thus it doesn't make sense to count characters at all. In particular this implies that it doesn't make sense to use anything other than UTF-8 encoded narrow-char strings anywhere outside Unicode libraries, even on Windows.

    Read this old thread for details.