Search code examples
cgccwhitespacestrtod

Are locale specific white-spaces skipped with `strtod()` compliant?


Is my C library compliant?

In testing strtod(), my code reported an interesting inconsistency:
strtod("\240" "123", ...) --> 0.0.

In locales that identified character 160 '\240' as a white-space, strtod() did not skip character 160 as a leading white-space, yet strtol() did.

I suspect my library has a corner bug as I expected strtod() to follow the current locale's isspace().

Is it a bug, allowed behavior or a fixed bug?


Sample code:

#include <ctype.h>
#include <errno.h>
#include <limits.h>
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>

void test_locale_name(const char *locale_name) {
  const char *current_locale = setlocale(LC_ALL, locale_name);
  if (current_locale) {
    printf("Current locale name \"%s\"\n", current_locale);
  }
  puts("White spaces in this locale\n");
  for (int i = UCHAR_MAX; i > 0; i--) {
    if (isspace(i) && i != '\n' && i != '\r') {
      char buf[100];
      snprintf(buf, sizeof buf, "%c123", i);
      char *endptr;
      errno = 0;
      long val = strtol(buf, &endptr, 0);
      printf("Character code %3d:   strtol(\"%s\") converts to %3ld, length = %d\n", i,
          buf, val, (int) (endptr - buf));
      errno = 0;
      double value_d = strtod(buf, &endptr);
      printf("Character code %3d:   strtod(\"%s\") converts to %3g, length = %d\n", i,
          buf, value_d, (int) (endptr - buf));
    }
  }
  puts("\n");
}

int main(void) {
  /*
   *  A null pointer for locale causes the setlocale function to return a pointer
   *  to the string associated with the category for the program’s current locale;
   *  the program’s locale is not changed.
   */
  const char *locale_names[] = { NULL, "POSIX", "af_ZA" };
  size_t locale_names_n = sizeof locale_names / sizeof locale_names[0];
  for (size_t i = 0; i < locale_names_n; i++) {
    test_locale_name(locale_names[i]);
  }
  return 0;
}

Sample output

Current locale name "C"
White spaces in this locale

Character code  32:   strtol(" 123") converts to 123, length = 4
Character code  32:   strtod(" 123") converts to 123, length = 4
Character code  12:   strtol("123") converts to 123, length = 4
Character code  12:   strtod("123") converts to 123, length = 4
Character code  11:   strtol("123") converts to 123, length = 4
Character code  11:   strtod("123") converts to 123, length = 4
Character code   9:   strtol("  123") converts to 123, length = 4
Character code   9:   strtod("  123") converts to 123, length = 4


Current locale name "C"
White spaces in this locale

Character code  32:   strtol(" 123") converts to 123, length = 4
Character code  32:   strtod(" 123") converts to 123, length = 4
Character code  12:   strtol("123") converts to 123, length = 4
Character code  12:   strtod("123") converts to 123, length = 4
Character code  11:   strtol("123") converts to 123, length = 4
Character code  11:   strtod("123") converts to 123, length = 4
Character code   9:   strtol("  123") converts to 123, length = 4
Character code   9:   strtod("  123") converts to 123, length = 4


Current locale name "af_ZA"
White spaces in this locale

Character code 160:   strtol("�123") converts to 123, length = 4
Character code 160:   strtod("�123") converts to   0, length = 0  ***!!!!***
Character code  32:   strtol(" 123") converts to 123, length = 4
Character code  32:   strtod(" 123") converts to 123, length = 4
Character code  12:   strtol("123") converts to 123, length = 4
Character code  12:   strtod("123") converts to 123, length = 4
Character code  11:   strtol("123") converts to 123, length = 4
Character code  11:   strtod("123") converts to 123, length = 4
Character code   9:   strtol("  123") converts to 123, length = 4
Character code   9:   strtod("  123") converts to 123, length = 4


C23 spec has:

In this clause, "white-space character" refers to (execution) white-space character as defined by isspace. C23dr § 7.1.1 5

7.24.1.5 The strtod, strtof, and strtold functions
First, they decompose the input string into three parts: an initial, possibly empty, sequence of white-space characters, a subject sequence resembling a floating constant ... § 7.24.1.5 2

In other than the "C" locale, additional locale-specific subject sequence forms may be accepted. § 7.24.1.5 7

and similar specs for strtol()

7.24.1.7 The strtol, strtoll, strtoul, and strtoull functions
First, they decompose the input string into three parts: an initial, possibly empty, sequence of white-space characters, a subject sequence resembling an integer ...
C23dr § 7.24.1.7 6

In other than the "C" locale, additional locale-specific subject sequence forms may be accepted.
C23dr § 7.24.1.7 3

Compiler output includes:

Invoking: Cygwin C Compiler
gcc -O0 -g3 -pedantic -Wall -Wextra -Wconversion -Wsign-conversion -c -std=c17  -fmessage-length=0 -Wformat -Wformat-security -Wformat=2 -Wmaybe-uninitialized -Werror=stringop-truncation -Wcast-align=strict -v -MMD -MP -MF"strtod7.d" -MT"strtod7.o" -o "strtod7.o" "../strtod7.c"
Using built-in specs.
COLLECT_GCC=gcc
Target: x86_64-pc-cygwin
Configured with: /mnt/share/cygpkgs/gcc/gcc.x86_64/src/gcc-12.4.0/configure --srcdir=/mnt/share/cygpkgs/gcc/gcc.x86_64/src/gcc-12.4.0 --prefix=/usr --exec-prefix=/usr --localstatedir=/var --sysconfdir=/etc --docdir=/usr/share/doc/gcc --htmldir=/usr/share/doc/gcc/html -C --build=x86_64-pc-cygwin --host=x86_64-pc-cygwin --target=x86_64-pc-cygwin --without-libiconv-prefix --without-libintl-prefix --libexecdir=/usr/lib --with-gcc-major-version-only --enable-shared --enable-shared-libgcc --enable-static --enable-version-specific-runtime-libs --enable-bootstrap --enable-__cxa_atexit --enable-clocale=newlib --with-dwarf2 --with-tune=generic --enable-languages=ada,c,c++,fortran,lto,objc,obj-c++,jit --enable-graphite --enable-threads=posix --enable-libatomic --enable-libgomp --enable-libquadmath --enable-libquadmath-support --disable-libssp --enable-libada --disable-symvers --disable-multilib --with-gnu-ld --with-gnu-as --with-cloog-include=/usr/include/cloog-isl --without-libiconv-prefix --without-libintl-prefix --with-system-zlib --enable-linker-build-id --with-default-libstdcxx-abi=gcc4-compatible --enable-libstdcxx-filesystem-ts
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 12.4.0 (GCC) 
COLLECT_GCC_OPTIONS='-O0' '-g3' '-Wpedantic' '-Wall' '-Wextra' '-Wconversion' '-Wsign-conversion' '-c' '-std=c17' '-fmessage-length=0' '-Wformat=1' '-Wformat-security' '-Wformat=2' '-Wmaybe-uninitialized' '-Werror=stringop-truncation' '-Wcast-align=strict' '-v' '-MMD' '-MP' '-MF' 'strtod7.d' '-MT' 'strtod7.o' '-o' 'strtod7.o' '-mtune=generic' '-march=x86-64'
 /usr/lib/gcc/x86_64-pc-cygwin/12/cc1.exe -quiet -v -MMD strtod7.d -MF strtod7.d -MP -MT strtod7.o -dD -idirafter /usr/lib/gcc/x86_64-pc-cygwin/12/../../../../lib/../include/w32api -idirafter /usr/lib/gcc/x86_64-pc-cygwin/12/../../../../x86_64-pc-cygwin/lib/../lib/../../include/w32api ../strtod7.c -quiet -dumpbase strtod7.c -dumpbase-ext .c -mtune=generic -march=x86-64 -g3 -O0 -Wpedantic -Wall -Wextra -Wconversion -Wsign-conversion -Wformat=1 -Wformat-security -Wformat=2 -Wmaybe-uninitialized -Werror=stringop-truncation -Wcast-align=strict -std=c17 -version -fmessage-length=0 -o /cygdrive/c/Users/TPC/AppData/Local/Temp/cch4QUWQ.s
GNU C17 (GCC) version 12.4.0 (x86_64-pc-cygwin)
    compiled by GNU C version 12.4.0, GMP version 6.3.0, MPFR version 4.2.1, MPC version 1.3.1, isl version isl-0.27-GMP

Although this C lib is not the latest gcc, newer (14.2?), perhaps less stable ones did not list a related fix that I could find.


Solution

  • I would consider this to be a bug: white-space is well defined in the C Standard for these functions and must be locale specific and consistent with the behavior of the isspace() macro.

    The implementation of strtod in the newlib, originally written by David M. Gay at AT&T has this (somewhat cryptic) code to parse the initial portion of the argument string:

        for(s = s00;;s++) switch(*s) {
            case '-':
                sign = 1;
                /* no break */
            case '+':
                if (*++s)
                    goto break2;
                /* no break */
            case 0:
                goto ret0;
            case '\t':
            case '\n':
            case '\v':
            case '\f':
            case '\r':
            case ' ':
                continue;
            default:
                goto break2;
            }
     break2:
    

    ASCII white space characters are tested explicitly and the locale information is only used to recognize the decimal separator.

    Conversely, strtol redirects to the much older _strtol_l implementation from the BSD code that has a naive loop to skip the Standard conforming locale dependent optional white space:

        /*
         * Skip white space and pick up leading +/- sign if any.
         * If base is 0, allow 0x for hex and 0 for octal, else
         * assume decimal; if base is already 16, allow 0x.
         */
        do {
            c = *s++;
        } while (isspace_l(c, loc));
        if (c == '-') {
            neg = 1;
            c = *s++;
        } else if (c == '+')
            c = *s++;
        if ((base == 0 || base == 16) &&
            c == '0' && (*s == 'x' || *s == 'X')) {
            c = s[1];
            s += 2;
            base = 16;
        }
        if (base == 0)
            base = c == '0' ? 8 : 10;
    

    Note however that this implementation is incorrect too as the 0x prefix is skipped unconditionally, which is non conforming if not followed by a hex digit.

    You should file at least one bug report with the newlib support team.