Search code examples
cmacosutf-8locale

How to handle values greater than 255 as returned by toupper() in recent macOS with UTF-8 locales


The problem, the code below is trying to address, is on how to effectively detect that an UTF-8 based locale might be in use so that all code points above 127 are not queried for ctype attributes, as we are dealing with plain (not wide) chars.

At least in macOS 14, when using an UTF-8 based locale. the following program will show 2 problematic code points that eventhough are valid for an unsigned char, are getting as a response from toupper() values that are not able to fit on that type:

#include <langinfo.h>
#include <ctype.h>
#include <locale.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>

#ifdef WORKAROUND
static int is_utf8_locale(void)
{
        const char *charmap = nl_langinfo(CODESET);

        /* this shouldn't happen */
        if (!charmap)
                return 0;

        if (!strncmp(charmap, "UTF-8", 5))
                return 1;

        /*
         * nl_langinfo should never return an empty string, unless the "item" used is invalid, and it
         * should return the C/POSIX CODESET if the locale is missing one, but ...
         */
        if (!*charmap) {
                unsigned char buf[MB_CUR_MAX + 1];
                return (wctomb((char *)&buf, 0xf8ff) == 3) &&
                        (buf[0] == 0xef && buf[1] == 0xa3 && buf[2] == 0xbf);
        }

        return 0;
}
#endif

int main(int argc, char *argv[])
{
        const char *locale = (argc > 1) ? argv[1] : "fr_FR";
        int i, f = 0;

        if (!setlocale(LC_CTYPE, locale))
                return 127;

#ifdef WORKAROUND
        if (is_utf8_locale())
                return 0;
#endif

        for (i = 128; i < 256; i++) {
                int u, l;
                unsigned char c = i;

                l = tolower(c);
                u = toupper(c);

                if (l > 255) {
                        int t = l % 256;
                        f++;
                        printf("tolower(%d) %c -> %d (%#x) %c\n", c, c, l, t, t);
                }
                if (u > 255) {
                        int t = u % 256;
                        f++;
                        printf("toupper(%d) %c -> %d (%#x) %c\n", c, c, u, t, t);
                }
        }
        return f;
}

The output (after some minor formatting) for the default fr_FR locale shows:

toupper(181) µ -> 924 (0x9c) <9c>
toupper(255) ÿ -> 376 (0x78) x

AFAIK, this change of behaviour for toupper() is somehow recent and at least wouldn't happen with 10.15, and while toupper() has been known to "sometimes" try to be more helpful (as a BSD extension), I couldn't see the problem in any recent BSD systems I tried, and they all mention that behaviour as deprecated, suggesting instead to use the wide char interfaces.

The "-DWORKAROUND" works but is IMHO too ugly and would be also problematic in a threaded environment, while being specially hacky because of the way macOS defines their locales.

All locales without a explicit .${CHARMAP} (which show the problematic response from nl_langinfo() described in the comments) as well as the ones that include .UTF-8 and that use sometimes .utf8 in other systems (allthough they always return the correct value from nl_langinfo() in those cases) seem to be affected.

The workaround obviously needs additional code for non POSIX systems.

The affected application doesn't support exotic multibyte encodings other than UTF-8, but the detection using wctomb() is fragile and might not work outside of Apple systems so suggestions or testing results are also appreciated.


Solution

  • The C Standard does not explicitly specify that values returned by toupper() must be in the range of type unsigned char or the special value EOF, but the paragraphs hereunder seem to imply that they should:

    7.4 Character handling <ctype.h>

    The header <ctype.h> declares several functions useful for classifying and mapping characters. In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined.

    [...]

    7.4.2.2 The toupper function

    Synopsis

    #include <ctype.h>
    int toupper(int c);
    

    Description
    The toupper function converts a lowercase letter to a corresponding uppercase letter.

    Returns
    If the argument is a character for which islower is true and there are one or more corresponding characters, as specified by the current locale, for which isupper is true, the toupper function returns one of the corresponding characters (always the same one for any given locale); otherwise, the argument is returned unchanged.

    This implies that toupper(181) either

    • returns a character for which isupper returns non zero
    • or returns 181

    If toupper returns a value greater than UCHAR_MAX for an argument in the defined range, this is a bug that is indeed likely to cause further problems in many programs, such as this one: https://github.com/PCRE2Project/pcre2/pull/313

    The problem is present on my system (macOS 13.4, Homebrew clang version 16.0.6, Target: x86_64-apple-darwin22.5.0) as shown with the test program below:

    #include <ctype.h>
    #include <langinfo.h>
    #include <locale.h>
    #include <stdio.h>
    #include <stdlib.h>
    
    int main(int argc, char *argv[]) {
        const char *locale = (argc > 1) ? argv[1] : "fr_FR";
        int f = 0;
    
        if (!setlocale(LC_CTYPE, locale))
            return 127;
    
        printf("Testing locale %s:\n", locale);
    
        for (int c = 128; c < 256; c++) {
            int l = tolower(c);
            int u = toupper(c);
    
            if (l > 255) {
                char cbuf[MB_CUR_MAX];
                char lbuf[MB_CUR_MAX];
                cbuf[wctomb(cbuf, c)] = '\0';
                lbuf[wctomb(lbuf, l)] = '\0';
                printf("%d: %s  isupper(%d): %d  tolower(%d): %d, %s\n",
                       c, cbuf, c, isupper(c), c, l, lbuf);
                f++;
            }
            if (u > 255) {
                char cbuf[MB_CUR_MAX];
                char ubuf[MB_CUR_MAX];
                cbuf[wctomb(cbuf, c)] = '\0';
                ubuf[wctomb(ubuf, u)] = '\0';
                printf("%d: %s  islower(%d): %d  toupper(%d): %d, %s\n",
                       c, cbuf, c, islower(c), c, u, ubuf);
                f++;
            }
        }
        if (f) {
            printf("%d errors!\n", f);
        }
        return f;
    }
    

    Output in my system:

    Testing locale fr_FR:
    181: µ  islower(181): 1  toupper(181): 924, Μ
    255: ÿ  islower(255): 1  toupper(255): 376, Ÿ
    2 errors!
    

    Looking at the source code for the Apple LibC, it seems they are trying to use the same tables for the <ctype.h> macros and wide character versions towupper and towlower, only rejecting argument values larger than UCHAR_MAX. It so happens that µ and ÿ have an uppercase version in Unicode that is greater than UCHAR_MAX and should be returned by towupper for this locale, but should be ignored by toupper. This is a bug in the implementation.