Search code examples
cunsigned-char

Count characters in UTF8 when plain char is unsigned


In UTF8 I use to count characters (not bytes) using this function:

int schars(const char *s)
{
    int i = 0;

    while (*s) {
        if ((*s & 0xc0) != 0x80) i++;
        s++;
    }
    return i;
}

Does this work on implementations where plain char is unsigned char?


Solution

  • It should.

    You are only using binary operators and those function the same irrespective of whether the underlying data type is signed or unsigned. The only exception may be the != operator, but you could replace this with a & and then embrace the whole thing with a !, ala:

    !((*s & 0xc0) & 0x80)
    

    and then you have solely binary operators.

    You can verify that the characters are promoted to integers by checking section 3.3.10 of the ANSI C Standard which states that "Each of the operands [of the bitwise AND] shall have integral type."

    EDIT

    I amend my answer. Bitwise operations are not the same on signed as on unsigned, as per 3.3 of the ANSI C Standard:

    Some operators (the unary operator ~ , and the binary operators << , >> , & , ^ , and | , collectively described as bitwise operators )shall have operands that have integral type. These operators return values that depend on the internal representations of integers, and thus have implementation-defined aspects for signed types.

    In fact, performing bitwise operations on signed integers is listed as a possible security hole here.

    In the Visual Studio compiler signed and unsigned are treated the same (see here).

    As this SO question discusses, it is better to use unsigned char to do byte-wise reads of memory and manipulations of memory.