Search code examples
cdata-conversion

What is the bug in this function that converts a hexatridecimal number (passed as a char) to a unsigned? How does one fix it?


I've been reading to the book Modern C by Jens Gustedt and on section 8.4, that present C library functions that deals with string processing and conversion, there is a piece of code in which is defined a function that converts a single char (using its int value) that represent a number on a hexatridecimal (36) base to its decimal equivalent. This function is supposed to explain how part of <stdlib.h>'s strtoul function works.

Here is the code presented:

/* Supposes that lowercase characters are contiguous. */
static_assert('z'-'a' == 25,
                "alphabetic characters not contiguous");
#include <ctype.h>
/* Converts an alphanumeric digit to an unsigned */
/* '0' ... '9'  =>  0 ..  9u */
/* 'A' ... 'Z'  => 10 .. 35u */
/* 'a' ... 'z'  => 10 .. 35u */
/* Other values =>   Greater */
unsigned hexatridecimal(int a) {
        if (isdigit(a)) {
                /* This is guaranteed to work: decimal digits
                   are consecutive, and isdigit is not
                   locale dependent. */
                return a - '0';
        } else {
                /* Leaves a unchanged if it not lowercase */
                a = toupper(a);
                /* Returns value >= 36 if not Latin uppercase */
                return (isupper(a)) ? 10 + (a - 'A') : -1;
        }
}

The thing is, to convert an alphabet letter to a decimal value it uses (a - 'A'), where a represent the letter being converted. About this operation, the author proceeds with this:

  1. The second return hexatridecimal makes an assumption about the relation between a and 'A'. What is it?

  2. Describe an error scenario in which this assumption is not fulfilled.

  3. Fix this bug: that is, rewrite the code such that it makes no assumption about the relation between a and 'A'.

Now, I don't know if I let some crucial piece of information fly past me or if I'm missing something else, but I can't figure something out for point number 2 and 3.

For the first point I suppose the assumption is a is greater or equal to 'A'. Since that is what is required to produce correct values according to the comment above the function. The thing is, by using the static_assert at the beginning of the code and the ternary operator at the return line, with isupper(a) as the condition, I can't see it happening and therefore don't know what to argue about the second point. Therefore, I'm also stuck at the third point.

Is the assumption something different from what I supposed? What is the bug in this, when would it lead to an error and how does one fix it?


Solution

  • The second return hexatridecimal makes an assumption about the relation between a and 'A'. What is it?

    It assumes that all characters 'A' to 'Z' as well as 'a' to 'z' are stored adjacently in alphabetical order in the symbol table. But in fact the C language makes no guarantees of any of that, which is probably what Gustedt wants to point out.

    There is for example the horror story of an ancient symbol table called EBCDIC where this isn't true. That one is supposedly still skulking about in very old IBM systems, or at least "language lawyers" love to scream "But what about EBCDIC!" whenever the topic of character to digit conversion pops up.

    The only characters that are guaranteed by C to be stored adjacently are '0' to '9'.


    Describe an error scenario in which this assumption is not fulfilled.

    EBCDIC!


    Fix this bug: that is, rewrite the code such that it makes no assumption about the relation between a and 'A'.

    You could slam in a bunch of static asserts etc for sure but then what if they fail? You wouldn't be able to use the function then. And alternatively, writing a bunch of run-time checks will slow down the code a lot.

    But besides that, the implementation in the book is naive in itself and could be improved in many ways. It turns out that we can fix performance and conversion problems in one go. A common way to implement fast symbol digit to integer conversion is to use a look-up table:

    unsigned hexatridecimal(int a)
    {
      static const int LOOKUP [127] = 
      {
        ['0'] = 0,
        ['1'] = 1,
        /* ... */
    
        ['A'] = 10,
        ['B'] = 11,
        /* ... */
    
        ['a'] = 10,
        ['b'] = 11,
        /* ... */
      };
    
      return LOOKUP[a];
    }
    

    This sacrifices 127*sizeof(int) bytes of read-only memory at the benefit of very fast conversion. The character value is used as an index to the look-up table array where the value to convert to is stored. No run-time calculations, branches or slow calls to ctype.h functions are required.

    The magic number 127 comes from the classic ASCII table (or UTF8) holding 127 printable characters, although the 32 lowest ones are non-printable. The beauty of over-allocating room for all manner of invalid characters is that when something is omitted from the LOOKUP initializer list because there was no initializer, that index will get set to zero.

    Error handling to prevent out of range numbers should probably be added, so that we know that the variable is in range 0 <= a < 127.

    Also notably there's no reason why we couldn't use the above lookup table to support all manner of bases from base 2 up to base 36 and use the same function for all of them, including the most common ones: binary, octal, decimal, hex.