Search code examples
clanguage-lawyerctype

How are wctype.h functions supposed to be used correctly?


The various is... functions (e.g. isalpha, isdigit) in ctype.h aren't entirely predictable. They take int arguments but expect character values in the unsigned char range, so on a platform where char is signed, passing a char value directly could lead to undesirable sign extension. I believe that the typical approach to handling this is to explicitly cast to an unsigned char first.

Okay, but what is the proper, portable way to deal with the various isw... functions in wctype.h? wchar_t, like char, also may be signed or unsigned, but because wchar_t is itself a typedef, a typename of unsigned wchar_t is illegal.


Solution

  • Upon re-reading the ISO C99 specification regarding wctype.h, it states:

    For all functions described in this subclause that accept an argument of type wint_t, the value shall be representable as a wchar_t or shall equal the value of the macro WEOF. If this argument has any other value, the behavior is undefined. (§7.25.1/5)

    Contrast this with the corresponding note for ctype.h:

    In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined. (§7.4/1)

    (emphasis mine)

    I think that it's also worth understanding the motivation for why the ctype.h functions require unsigned char representations. The standard requires that EOF be a negative int (§7.19.1/3), so the ctype.h functions use unsigned char representations to (try to) avoid potential ambiguity.

    In contrast, that motivation doesn't exist for wctype.h functions. The standard makes no such requirement of WEOF, elaborated by footnote 270:

    The value of the macro WEOF may differ from that of EOF and need not be negative.

    because WEOF is already guaranteed to not conflict with any character represented by wchar_t (§7.24.1/3).

    Therefore the wctype.h functions don't have or need any of the unsigned nonsense, and wchar_t values can be passed to them directly.