I have a code that implements the following:
unsigned char charStr; //this var can only take a value either 0, 1, or 2
WCHAR wcharStr;
...
charStr = wcharStr - '0';
...
I am aware of the fact that you might lose some data (from 16-bit to 8-bit) while making a conversion from Unicode (wchar_t data type) to ANSI (unsigned char). However, can someone explain why substracting '0' make this conversion right ?
The C and C++ language standard requires that the encodings for the digits from 0
to 9
be consecutive. Therefore, subtracting '4' - '0'
, for example, will get you 4
.
This is not actually required for wchar_t
, but in the real world, your compiler will map that to Unicode, either UTF-16 on Windows or UCS-4 elsewhere. The first 128 code points of Unicode are the same as ASCII. You’re not compiling this code on the one modern, real-world compiler that uses a non-ASCII character set (IBM’s Z-series mainframes, which default to Code Page 1047 for backward compatibility), so your compiler converts your wchar_t
and char
to some integral type, probably 32 bits wide, subtracts, and gets a digit value. It then stores that in a variable of type unsigned char
, which is a mistake because it’s actually the ASCII value of an unprintable control character.
This code is not correct. If you want to convert from wchar_t
to char
, you should use either codecvt
from the STL or wcrtomb()
from the C standard library. There is also a wctob()
that converts to a single byte if and only if that’s possible. Set your locale before you use them.
If you’re sure that your wchar_t
holds Unicode, that your unsigned char
holds Latin-1, and your values are within range, however, you can simply cast the wchar_t
value to (unsigned char)
. Another approach, if you know you have a digit, is to write (charStr - L'0') + '0'
.