Search code examples
c++unicodeutf-8c++20char8-t

Escape sequences for char8_t and unsigned char


Trying to use escape sequences to construct a char8_t string (to not rely on file/compiler encoding), I got issue with MSVC.

I wonder if it is a bug, or if it is implemention dependent.
Is there a workaround?

constexpr char8_t s1[] =     u8"\xe3\x82\xb3 \xe3\x83\xb3 \xe3\x83\x8b \xe3\x83\x81 \xe3\x83\x8f";
constexpr unsigned char s2[] = "\xe3\x82\xb3 \xe3\x83\xb3 \xe3\x83\x8b \xe3\x83\x81 \xe3\x83\x8f";
//constexpr char8_t s3[] = u8"コ ン ニ チ ハ";

static_assert(std::equal(std::begin(s1), std::end(s1),
                         std::begin(s2), std::end(s2))); // Fail on msvc

Demo

Note: Final goal is to replace std::filesystem::u8path(s2) (std::filesystem::u8path is deprecated since C++20) by std::filesystem::path(s1);


Solution

  • This is a bug in MSVC that I expect to be fixed at some point during Microsoft's implementation of C++23.

    Historically, numeric escape sequences in character and string literals were not well specified in the C++ standard and this lead to a number of core issues. These issues were addressed by P2029; a paper adopted for C++23 in November of 2020. The reported MSVC bug (along with an additional one related to non-encodeable characters) is discussed in the "Implementation impact" section of the paper.

    As mentioned by other commenters, use of universal-character-names (UCNs) like \u1234 would be the preferred solution to avoid a dependency on source file encoding.