Search code examples
ccastingdereferencestrncmp

When returning the difference between pointers of char strings, how important is the order of casting and dereferencing?


For educational purposes (yes 42 yes) I'm rewriting strncmp and a classmate just came up to me asking why I was casting my returnvalues in such a way. My suggestion was to typecast first and dereference afterwards. My logic was that I wanted to treat the char string as an unsigned char string and dereference it as such.

int strncmp(const char *s1, const char *s2, size_t n)
{
    if (n == 0)
        return (0);
    while (*s1 == *s2 && *s1 && n > 1)
    {
        n--;
        s1++;
        s2++;
    }
    return (*(unsigned char *)s1 - *(unsigned char *)s2);
}

His was to dereference first and to typecast afterwards in order to make absolutely sure it returns the difference between two unsigned chars. Like this:

return ((unsigned char)*s1 - (unsigned char)*s2);

Following the discussion (and me agreeing with him my casting is weird) we looked up some source code of production-ready implementations and to to our surprise Apple seems to cast/dereference in the same order as I do:

https://opensource.apple.com/source/Libc/Libc-167/gen.subproj/i386.subproj/strncmp.c.auto.html

Therefore the question: what is the difference in this case? And why choose one over the other?

(I've already found the following; but it specifies the casting/dereferencing of datatypes of different sizes whereas in the case of chars/unsigned chars it shouldn't matter right?

In C, if I cast & dereference a pointer, does it matter which one I do first? )


Solution

  • On a two's complement system (which is pretty much all of them), it won't make a difference.

    The first example--*(unsigned char *)x-- will simply interpret the binary value of the data stored at the location as an unsigned char, so if the decimal value stored at the location is -1, then hex value (assuming CHAR_BIT=8) stored is 0xFF and then it will be simply be interpreted as 255 as it fits the hex representation.

    The second example (assuming char is signed on this compiler)--(unsigned char)*x-- will first grab the value stored at the location and then cast it to unsigned. So we get -1 and in casting it to unsigned char, the standard states that to translate a negative signed number to an unsigned value, you add one more than the max value storable by that type to the negative value as much as necessary until you have a value within its range. So you get -1 + 256 = 255

    However, if you somehow were on a one's complement system, things go a bit differently.

    Again, using *(unsigned char *)x, we reinterpret the hex representation of -1 as an unsigned char, but this time the hex representation is 0xFE, which will be interpreted as 254 rather than 255.

    Going back to (unsigned char)*x, it will still just take take perform the -1 + 256 to get the end result of 255.

    All that said, I'm not sure if the 8th bit of a char can be used by a character encoding by the C standard. I know it's not used in ASCII-encoded strings, which again is what you will most likely be working with, so you likely won't come across any negative values when comparing actual strings.


    Converting from signed to unsigned can be found in the C11 standard at section 6.3.1.3:

    1. When a value with integer type is converted to another integer type other than _Bool, if the value can be represented by the new type, it is unchanged.

    2. Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type.