Windows defines the wchar_t
symbol to be 16 bits long. However, the UTF-16
encoding used tells us that some symbols may actually be encoded with 4 bytes (32 bits).
Does this mean that if I'm developing an application for Windows
, the following statement:
wchar_t symbol = ... // Whatever
might only represent a part of the actual symbol?
And what will happen if I do the same under *nix
, where wchar_t
is 32 bits long?
Yes, it means that symbol
may hold a part of a surrogate pair on Windows. On *nixes
wchar_t
is 32 bit long and will hold the whole Unicode character set. Note that a Unicode code-point doesn't represent a character, since some characters may be encoded by more than one Unicode code-point, thus it doesn't make sense to count characters at all. In particular this implies that it doesn't make sense to use anything other than UTF-8 encoded narrow-char strings anywhere outside Unicode libraries, even on Windows.
Read this old thread for details.