Search code examples
winapiutf-16

Does the WinApi ever validate UTF-16?


The Windows documentation makes repeated reference to both UNICODE and UTF-16. I know this is a lie for the file system (i.e. it accepts any sequence of wchar_t) and other documentation suggests that invalid UTF-16 is merely "undefined. So I'm confused. Can I assume non filesystem APIs will return valid UTF-16? Or should I assume it wont?

Edit: As it's causing some confusion I'll explain a few terms


UTF-16

UTF-16 is defined in the Unicode specification (pdf). The FAQ makes clear what is and isn't well formed UTF-16:

Are there any 16-bit values that are invalid?

Unpaired surrogates are invalid in UTFs. These include any value in the range D80016 to DBFF16 not followed by a value in the range DC0016 to DFFF16, or any value in the range DC0016 to DFFF16 not preceded by a value in the range D80016 to DBFF16.

What about noncharacters? Are they invalid?

Not at all. Noncharacters are valid in UTFs and must be properly converted. For more details on the definition and use of noncharacters, as well as their correct representation in each UTF, see the Noncharacters FAQ.

So the only restriction is that a leading surrogate must be followed by a trailing surrogate (aka a surrogate pair). All other wchar_t (16 bit) values should be accepted as is.


UCS-2

As mentioned in Ben Voigt's answer. This is a now obsolete encoding that allowed any wchar_t values. As it doesn't have the same restrictions as UTF-16, a subset of UCS-2 strings are invalid UTF-16.


Solution

  • Windows wide characters are arbitrary 16-bit numbers (formerly called "UCS-2", before the Unicode Standard Consortium purged that notation). So you cannot assume that it will be a valid UTF-16 sequence. (MultiByteToWideChar is a notable exception that does return only UTF-16)

    Decoding as UTF-16 only makes sense if the program that generated the string used a UTF-16 convention, but there's no guarantee about that just as there's no guarantee that 8-bit characters contain UTF-8.