c++type-conversion byte short endianness

Error converting char[2] to unsigned short?

Edit:

After reading the comments, thanks to @M.M and @AnttiHaapala I fixed my code but still got incorrect outputs...

New Code:

#include <iostream>
int main() {
    char * myChar;
    myChar = new char[2];
    myChar[1] = 0x00;
    myChar[0] = 0xE0;
    unsigned short myShort;
    myShort = ((myChar[1] << 8) | (myChar[0]));
    std::cout << myShort << std::endl;
    return 0;
}

Output:

or if you reverse the order

Old Post:

So I have a two byte value that I am reading from a file and would like to convert to a unsigned short so I can use the numerical value.

Example code:

#include <iostream>
int main() {
    char myChar[2];
    myChar[1] = 'à';
    myChar[0] = '\0';
    unsigned short myShort;
    myShort = ((myChar[1] << 8) | (myChar[0]));
    std::cout << myShort << std::endl;
    return 0;
}

Output:

But à\0 or E0 00 should have a value of 224 as an unsigned two byte value?

Also very interesting...

This code:

include <iostream>
int main() {
    char * myChar;
    myChar = "\0à";
    unsigned short myShort;
    myShort = ((myChar[1] << 8) | (myChar[0]));
    std::cout << myShort << std::endl;
    return 0;
}

Outputs:

Solution

NOTE: The original code has a complicating factor in that the source is UTF-8 encoded. Please check edit history of this answer to see my comments on that. However I think that is not the main issue you are asking about, so I have changed my answer to just address the edit. To avoid UTF-8 conversion issues, use '\xE0' instead of 'à'.

Regarding the edited code:

char * myChar;
myChar = new char[2];
myChar[1] = 0x00;
myChar[0] = 0xE0;
unsigned short myShort;
myShort = ((myChar[1] << 8) | (myChar[0]));
std::cout << myShort << std::endl;

The range of char (on your system) is -128 through to 127. This is common. You write myChar[0] = 224;. (0xE0 is an int literal with value 224).

This is an out of range conversion, which causes implementation-defined behaviour. Most commonly, implementations will define this to adjust modulo 256 until the value is in range. So you end up with the same result as:

myChar[0] = -32;

Then the calculation (myChar[1] << 8) | myChar[0] is 0 | (-32), which is -32. Finally, you convert to unsigned short. This is another out-of-range conversion, because the range of unsigned short is [0, 65535] on your system.

However, out-of-range conversion to unsigned type is well-defined to adjust modulo 65536 in this case, so the result is 65536 - 32 = 65504.

Reversing the order performs ((-32) << 8) | 0. Left-shifting a negative value causes undefined behaviour, although on your system it has manifested itself as doing -32 * 256, giving -8192. Converting that to unsigned short gives 65536 - 8192 = 57344.

If you are trying to get 224 from the first example, the simplest way to do this is to use unsigned char instead of char. Then myChar[0] will hold the value 224 instead of the value -32.