Search code examples
ccharintprintfnegative-number

Can %c be given a negative int argument in printf?


Can I pass a negative int in printf while printing through format specifier %c since while printing int gets converted into an unsigned char? Is printf("%c", -65); valid? — I tried it on GCC but getting a diamond-like character(with question-mark inside) as output. Why?


Solution

  • Absolutely yes, if char is a signed type. C allows char to be either signed or unsigned and in GCC you can switch between them with -funsigned-char and -fsigned-char. When char is signed it's exactly the same thing as this

    char c = -65;
    printf("%c", c);
    

    When passing to printf() the char variable will be sign-extended to int so printf() will also see -65 like if it's passed from a constant. printf simply has no way to differentiate between printf("%c", c); and printf("%c", -65); due to default promotion in variadic functions.

    The printing result depends on the character encoding though. For example in the ISO-8859-1 or Windows-1252 charsets you'll see ¿ because (unsigned char)-65 == 0xBF. In UTF-8 (which is a variable-length encoding) 0xBF is not allowed as a character in the starting position. That's why you see � which is the replacement character for invalid bytes

    Please tell me why the code point 0 to 255 are not mapped to 0 to 255 in unsigned char. I mean that they are non-negative so shouldn't I just look through the UTF-8 character set for their corresponding values?

    The mapping is not done by relative position in the range as you thought, i.e. code point 0 maps to the CHAR_MIN, code point 40 maps to CHAR_MIN + 40, code point 255 maps to CHAR_MAX... In two's complement systems it's typically a simple mapping based on the value of the bit pattern when treating as unsigned. That's because the way values are usually truncated from a wider type. In C a character literal like 'a' has type int. Suppose 'a' is mapped to code point 130 in some theoretical character set then the below lines are equivalent

    char c = 'a';
    char c = 130;
    

    Either way c will be assigned a value of 'a' after casting to char, i.e. (char)'a', which may be a negative value

    So code points 0 to 255 are mapped to 0 to 255 in unsigned char. That means code point code point 0x1F will be stored in a char (signed or unsigned) with value 0x1F. Code point 0xBF will be mapped to 0xBF if char is unsigned and -65 if char is signed

    I'm assuming 8-bit char for all the above things. Also note that UTF-8 is an encoding for the Unicode character set, it's not a charset on its own so you can't look up UTF-8 code points