Search code examples
c++c++11winapiunicodeutf-16

How to convert from "universal character name" to a sequence of wchar_t's?


In my application I need to be able to parse string literals supported by C++14 standard. So I'm reading this article on the subject and I'm trying to understand, how do I convert from a universal character name to a sequence of wchar_ts?

Let me explain with this example. Say, if I compile the following with VS 2017:

const wchar_t* str2 = L"\U0001F609 is ;-)";

str2 becomes the following sequence of bytes in memory:

enter image description here

So how did \U0001F609 become 3d d8 09 de? Or, what WinAPI do I need to make this conversion?


Solution

  • how did \U0001F609 become 3d d8 09 de?

    wchar_t is 16-bit on windows, but 0x1F609 > UINT16_MAX, therefore so-called surrogate pairs are used to encode the code point into two code units of 16-bit each.

    From Wikipedia:

    • 0x10000 is subtracted from the code point, leaving a 20-bit number in the range 0x00000–0xFFFFF. (0x1F609 - 0x10000 = 0xF609)
    • The high ten bits (in the range 0x000–0x3FF) are added to 0xD800 to give the first 16-bit code unit or high surrogate, which will be in the range 0xD800–0xDBFF. (0xD800 + 0x3D)
    • The low ten bits (also in the range 0x000–0x3FF) are added to 0xDC00 to give the second 16-bit code unit or low surrogate, which will be in the range 0xDC00–0xDFFF. (0xDC00 + 0x209)

    Which leaves 0xD83D 0xDE09. Encoding this as two little endian 16-bit code units gives 3D D8 09 DE.

    how do I convert from a universal character name to a sequence of wchar_ts?

    The 'Universal character name' is a unicode code point. The wchar_t on windows is UTF-16.

    What WinAPI do I need to make this conversion?

    I don't know if there are any APIs specifically for that, but it's quite easy to write your own UTF-32* to UTF-16 converter. Check the Wikipedia page for more information


    *: Because 32-bit is big enough to contain all of unicode, every code point can be encoded in one UTF-32 code unit.