Search code examples
c++c++17language-lawyerundefined-behaviorstrict-aliasing

Is casting strings from `wchar_t` to `char16_t` legal if encoding and width is the same?


On Windows, wchar_t is a UTF-16(LE) formatted character, which is -- for the most part -- equivalent to char16_t. However, these two character types are still distinct types in the C++ type-system -- which makes me uncertain whether converting between sequences of these two character types is legal as per the C++ standard.

My question is this: In C++17, is it legal to perform the following casts, and to read from the converted pointers:

  • reinterpret_cast<const wchar_t*>(char16_ptr) where decltype(char16_ptr) is const char16_t*, and
  • reinterpret_cast<const char16_t*>(wchar_ptr) where decltype(wchar_ptr) is const wchar_t*

For the purposes of this question, assume the following:

  • sizeof(wchar_t) == sizeof(char16_t), and
  • wchar_t is formatted the same as char16_t (as is the case on Windows)

Basically, is this a violation of a strict-aliasing?

My understanding that the cast itself is valid thanks to [expr.reinterpret.cast]/7, but that the result of the cast cannot safely be used since the type is being aliased by something that isn't char, unsigned char, or std::byte. Is this interpretation correct?


Note: Other questions have been asked regarding wchar_t and char16_t being the same, but this question is not a duplicate of those as far as I can tell. Notably, the question "Are wchar_t and char16_t the same on Windows?" actually performs a reinterpret_cast between pointers, but none of the answers actually address whether this cast was ever legal in the first place.


Solution

  • You already know the answer to this: strictly speaking, no.

    wchar_t is not char16_t. Neither derives from the other. Neither is similar to the other. Neither is a signed/unsigned version of the other. Neither is an aggregate containing the other.And neither of them is a bytewise type (char, etc).

    So you cannot access a wchar_t through a pointer/reference to a char16_t.

    If strict avoidance of strict aliasing is your goal, you're going to have to copy the data to a different object. That is valid, assuming they both have the same representation.