Search code examples
ckernighan-and-ritchie

Subtlety in conversion of characters to integers


Can someone explain clearly what these lines from K&R actually mean:

"When a char is converted to an int, can it ever produce a negative integer? The answer varies from machine to machine. The definition of C guarantees that any character in the machine's standard printing character set will never be negative, but arbitrary bit patterns stored in character variables may appear to be negative on some machines,yet positive on others".


Solution

  • You need to understand several things first.

    1. If I take an 8-bit value and extend it to a 16-bit value, normally you would imagine just adding a bunch of 0's on the left. For example, if I have the 8-bit value 23, in binary that's 00010111, so as a 16-bit number it's 0000000000010111, which is also 23.

    2. It turns out that negative numbers always have a 1 in the high-order bit. (There might be weird machines for which this is not true, but it's true for any machine you're likely to use.) For example, the 8-bit value -40 is represented in binary as 11011000.

    3. So when you convert a signed 8-bit value to a 16-bit value, if the high-order bit is 1 (that is, if the number is negative), you do not add a bunch of 0-s on the left, you add a bunch of 1's instead. For example, going back to -40, we would convert 11011000 to 1111111111011000, which is the 16-bit representation of -40.

    4. There are also unsigned numbers, that are never negative. For example, the 8-bit unsigned number 216 is represented as 11011000. (You will notice that this is the same bit pattern as the signed number -40 had.) When you convert an unsigned 8-bit number to 16 bits, you add a bunch of 0's no matter what. For example, you would convert 11011000 to 0000000011011000, which is the 16-bit representation of 216.

    5. So, putting this all together, if you're converting an 8-bit number to 16 (or more) bits, you have to look at two things. First, is the number signed or unsigned? If it's unsigned, just add a bunch of 0's on the left. But if it's signed, you have to look at the high-order bit of the 8-0bit number. If it's 0 (if the number is positive), add a bunch of 0's on the left. But if it's 1 (if the number is negative), add a bunch of 1's on the right. (This whole process is known as sign extension.)

    6. The ordinary ASCII characters (like 'A' and '1' and '$') all have values less than 128, which means that their high-order bit is always 0. But "special" characters from the "Latin-1" or UTF-8 character sets have values greater than 128. For this reason they're sometimes also called "high bit" or "eighth bit" characters. For example, the Latin-1 character Ø (O with a slash through it it) has the value 216.

    7. Finally, although type char in C is typically an 8-bit type, the C Standard does not specify whether it is signed or unsigned.

    Putting this all together, what Kernighan and Ritchie are saying is that when we convert a char to a 16- or 32-bit integer, we don't quite know how to apply step 5. If I'm on a machine where type char is unsigned, and I take the character Ø and convert it to an int, I'll probably get the value 216. But if I'm on a machine where type char is signed, I'll probably get the number -40.