In my application I need to be able to parse string literals supported by C++14 standard. So I'm reading this article on the subject and I'm trying to understand, how do I convert from a universal character name
to a sequence of wchar_t
s?
Let me explain with this example. Say, if I compile the following with VS 2017:
const wchar_t* str2 = L"\U0001F609 is ;-)";
str2
becomes the following sequence of bytes in memory:
So how did \U0001F609
become 3d d8 09 de
? Or, what WinAPI do I need to make this conversion?
how did
\U0001F609
become3d d8 09 de
?
wchar_t
is 16-bit on windows, but 0x1F609 > UINT16_MAX
, therefore so-called surrogate pairs are used to encode the code point into two code units of 16-bit each.
From Wikipedia:
0x1F609 - 0x10000 = 0xF609
)0xD800 + 0x3D
)0xDC00 + 0x209
)Which leaves 0xD83D 0xDE09
. Encoding this as two little endian 16-bit code units gives 3D D8 09 DE
.
how do I convert from a universal character name to a sequence of wchar_ts?
The 'Universal character name' is a unicode code point. The wchar_t
on windows is UTF-16.
What WinAPI do I need to make this conversion?
I don't know if there are any APIs specifically for that, but it's quite easy to write your own UTF-32* to UTF-16 converter. Check the Wikipedia page for more information
*: Because 32-bit is big enough to contain all of unicode, every code point can be encoded in one UTF-32 code unit.