Search code examples
c++winapiunicodeencodingcharacter-encoding

Which encoding works best for Windows API calls?


I use a function from the Windows API called GetFileAttributesW to retrieve attributes from a file. The function signature is defined as:

DWORD GetFileAttributesW([in] LPCWSTR lpFileName);

LPCWSTR is defined as const wchar_t*.

I want to call the function:

fs::path inputPath = ...
GetFileAttributesW(inputPath.whichMethodHere()?);

The given input path is of type std::filesystem::path that has several convenience converters like:

std::filesystem::path::string()
std::filesystem::path::wstring()
std::filesystem::path::u8string()
std::filesystem::path::u16string()
std::filesystem::path::u32string()

The two functions wstring() and u16string() stand out here for me. Both are defined with a character type of wchar_t and char16_t respectively.

Question 1:

What is the main difference between wstring() and u16string(). Does wstring() return something different than u16string in real life scenarios?

Question 2:

Which encoding does the Windows API generally expect?


Solution

  • TL;DR: char16_t doesn't provide any substantial advantage over wchar_t on Windows, and it's less convenient to use. Choose wchar_t, always. Microsoft's implementation of path::native() returns a std::wstring (not a std::u16string) as a matter of a conscious decision.


    Which encoding does the Windows API generally expect?

    Windows uses UTF-16LE encoding internally, everywhere, but doesn't enforce well-formedness anywhere. Particularly when dealing with filesystem objects, any sequence of 16-bit values (with a few exceptions) is admissible.

    What is the main difference between wstring() and u16string(). Does wstring() return something different than u16string in real life scenarios?

    The answer is more involved than this proposed answer suggests. wstring() and u16string() return a std::wstring and std::u16string, respectively, which are std::basic_string instantiations for wchar_t and char16_t.

    The question thus boils down to this: What is the difference between wchar_t and char16_t? The answer is disturbingly unspecific. The section on character types lists them as following:

    • wchar_t - type for wide character representation (see wide strings). Required to be large enough to represent any supported character code point. It has the same size, signedness, and alignment as one of the integer types, but is a distinct type.

    • char16_t - type for UTF-16 character representation, required to be large enough to represent any UTF-16 code unit (16 bits). It has the same size, signedness, and alignment as std::uint_least16_t, but is a distinct type.

    If you read carefully, neither type has a fixed size, and neither type is required to use any particular character encoding. It is specifically true that C++ makes no guarantees that char16_t must store UTF-16 encoded text (only that the type must be able to).

    What does this mean in practice? Since Windows uses UTF-16LE everywhere, wchar_t has historically always been a 16-bit type on Windows. The introduction of char16_t on Windows hasn't changed much: It's just another type capable of representing UTF-16 encoded text. Since it is a distinct type, you're going to have to cast when passing a char16_t* into a function (which has always accepted a wchar_t*, or, rather, one of its C approximations).

    On Windows, you can pretty much assume that wchar_t and char16_t will be the same width, and hold UTF-16 encoded characters. You can thus also assume, that wstring() and u16string() will return the same binary data. wchar_t is generally more useful as that's the native type used in interfaces, and doesn't require casts.