c string utf-8 type-conversion non-ascii-characters

Converting non-Ascii characters to int in C, the extra bits are supplemented by 1 rather than 0

When coding in C, I have accidently found that as for non-Ascii characters, after they are converted from char (1 byte) to int (4 bytes), the extra bits (3 bytes) are supplemented by 1 rather than 0. (As for Ascii characters, the extra bits are supplemented by 0.) For example:

char c[] = "ā";
int i = c[0];
printf("%x\n", i);

And the result is ffffffc4, rather than c4 itself. (The UTF-8 code for ā is \xc4\x81.)

Another related issue is that when performing right shift operations >> on a non-Ascii character, the extra bits on the left end are also supplemented by 1 rather than 0, even though the char variable is explicitly converted to unsigned int (for as for signed int, the extra bits are supplemented by 1 in my OS). For example:

char c[] = "ā";
unsigned int u_c;
int i = c[0];
unsigned int u_i = c[0];

c[0] = (unsigned int)c[0] >> 1; 
u_c = (unsigned int)c[0] >> 1;      
i = i >> 1;
u_i = u_i >> 1;
printf("c=%x\n", (unsigned int)c[0]); // result: ffffffe2. The same with the signed int i.
printf("u_c=%x\n", u_c); // result: 7fffffe2.
printf("i=%x\n", i); // result: ffffffe2.
printf("u_i=%x\n", u_i); // result: 7fffffe2.

Now I am confused with these results... Are they concerned with the data structures of char, int and unsigned int, or related to my operating system (ubuntu 14.04), or related to the ANSI C requirements? I have tried to compile this program with both gcc(4.8.4) and clang(3.4), but there is no difference.

Thank you so much!

Solution

It is implementation-defined whether char is signed or unsigned. On x86 computers, char is customarily a signed integer type; and on ARM it is customarily an unsigned integer type.

A signed integer will be sign-extended when converted to a larger signed type;

a signed integer converted to unsigned integer will use the modulo arithmetic to wrap the signed value into the range of the unsigned type as if by repeatedly adding or subtracting the maximum value of the unsigned type + 1.

The solution is to use/cast to unsigned char if you want the value to be portably zero-extended, or for storing small integers in range 0..255.

Likewise, if you want to store signed integers in range -127..127/128, use signed char.

Use char if the signedness doesn't matter - the implementation will probably have chosen the type that is the most efficient for the platform.

Likewise, for the assignment

unsigned int u_c; u_c = (uint8_t)c[0];,

Since -0x3c or -60 is not in the range of uint16_t, then the actual value is the value (mod UINT16_MAX + 1) that falls in the range of uint16_t; iow, we add or subtract UINT16_MAX + 1 (notice that the integer promotions could trick here so you might need casts if in C code) until the value is in the range. UINT16_MAX is naturally always 0xFFFFF; add 1 to it to get 0x10000. 0x10000 - 0x3C is 0xFFC4 that you saw. And then the uint16_t value is zero-extended to the uint32_t value.

Had you run this on a platform where char is unsigned, the result would have been 0xC4!

BTW in i = i >> 1;, i is a signed integer with a negative value; C11 says that the value is implementation-defined, so the actual behaviour can change from compiler to compiler. The GCC manuals state that

Signed >> acts on negative numbers by sign extension.

However a strictly-conforming program should not rely on this.