Search code examples
c++stringunicodeansi

Converting an ANSI C-String to UNICODE


Note: I am trying to write my own function that performs this conversion

I understand that a char is 1 byte, while a wchar_t is 2 bytes.

So this is how a conversion would happen:

1) Input a text

Hello, world

2) Get the bytes of the string

48 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21

3) Allocate memory twice the number of bytes

00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

4) Fill a byte with the ANSI value, skipping one byte at a time

48 00 65 00 6c 00 6c 00 6f 00 2c 00 20 00 77 00 6f 00 72 00 6c 00 64 00 21 00

I have a couple of questions about this process:

1) Can I simply cast an ANSI string to UNICODE and have it replicate the exact process above, or will it simply fill the first half of the bytes with the ANSI bytes and leave the rest to 0?

char a[] = { "Hello, world!" };
wchar_t* b = reinterpret_cast<wchar_t*>(a);

2) Looking at the MultiByteToWideChar function, I see a CodePage argument and I wonder what it is. Isn't the conversion all the same (as I understand it and wrote it out above)? I thought the ASCII character codes were all the same everywhere, but this argument seems to say otherwise if I am understanding correctly from the fact it has values for Mac and Windows there.


Solution

  • I thought the ASCII character codes were all the same everywhere, but this argument seems to say otherwise if I am understanding correctly from the fact it has values for Mac and Windows there.

    The ASCII codes are, yes, but the high bit of an "Extended ASCII" string (spoiler: there's no such thing) maps to any of a large number of codepages, all different encodings intended for use mostly in different geographic locales. The approach you've taken is fine for the simple, plain ASCII case, but it doesn't work in general, and MultiByteToWideChar knows this. It will re-encode properly from whatever codepage you're using, to what Windows confusingly calls "Unicode" (not "UNICODE"), which is actually more specifically the "UTF-16" encoding.

    Can I simply cast an ANSI string to UNICODE and have it replicate the exact process above, or will it simply fill the first half of the bytes with the ANSI bytes and leave the rest to 0?

    No. A cast does not reencode things or change values. There you are just saying "I promise that a is a bunch of wchar_ts, even though it has type char* (it doesn't, it has array type, but close enough for today).

    That code actually has undefined behaviour, if you use b, because you've broken aliasing rules (you can examine a T through a char*, but you can't treat a char[] as some T that you never created). But, if it didn't, you'd find that your "string" were now half the length, and more than likely an invalid UTF-16 sequence that would not render correctly anywhere.

    So if I wanted to support UTF-32, I would have to create my own wrapper for strings since wchar_t is only 2 bytes long and I need 4 bytes, and also I would not be able to print it with printf for example, correct?

    Technically, sort of yes (though you'd use a library like libicu rather than rolling your own).

    But, in reality, you don't want to use UTF-32. Working with the Windows API you're stuck with UTF-16, but other than that we generally prefer UTF-8 over char, which is nice and portable and flexible and good and nice. (You will again want a library for this though.)

    It'd then be up to you as to where you perform the relevant conversions, and/or whether you have a switch that flips from UTF-8 to UTF-16 depending on the platform (like Windows's old UNICODE macro) or just run UTF-8 everywhere until you hit a Windows API boundary.

    Or, if all your input is ASCII as you imply, then you don't really need to do anything other than what you are already: either keep your ASCII throughout the program but convert it to UTF-16 when using the Windows API, or use UTF-16 (and wchar_ts throughout your whole program and have no conversions. Make sure to use wide-char versions of your favourite functions, though (like wprintf) if you go down that route.