Search code examples
c++cunicodeansi

wchar_t to unsigned char conversion


I have a code that implements the following:

unsigned char charStr; //this var can only take a value either 0, 1, or 2
WCHAR wcharStr;
...
charStr = wcharStr - '0';
...

I am aware of the fact that you might lose some data (from 16-bit to 8-bit) while making a conversion from Unicode (wchar_t data type) to ANSI (unsigned char). However, can someone explain why substracting '0' make this conversion right ?


Solution

  • The C and C++ language standard requires that the encodings for the digits from 0 to 9 be consecutive. Therefore, subtracting '4' - '0', for example, will get you 4.

    This is not actually required for wchar_t, but in the real world, your compiler will map that to Unicode, either UTF-16 on Windows or UCS-4 elsewhere. The first 128 code points of Unicode are the same as ASCII. You’re not compiling this code on the one modern, real-world compiler that uses a non-ASCII character set (IBM’s Z-series mainframes, which default to Code Page 1047 for backward compatibility), so your compiler converts your wchar_t and char to some integral type, probably 32 bits wide, subtracts, and gets a digit value. It then stores that in a variable of type unsigned char, which is a mistake because it’s actually the ASCII value of an unprintable control character.

    This code is not correct. If you want to convert from wchar_t to char, you should use either codecvt from the STL or wcrtomb() from the C standard library. There is also a wctob() that converts to a single byte if and only if that’s possible. Set your locale before you use them.

    If you’re sure that your wchar_t holds Unicode, that your unsigned char holds Latin-1, and your values are within range, however, you can simply cast the wchar_t value to (unsigned char). Another approach, if you know you have a digit, is to write (charStr - L'0') + '0'.