Search code examples
cscanfstdionull-character

Inconsistent behavior of fscanf() across different compilers (consuming trailing null character)


I wrote a complete application in C99 and tested it thoroughly on two GNU/Linux-based systems. I was surprised when an attempt to compile it using Visual Studio on Windows resulted in the application misbehaving. At first I couldn't assert what was wrong, but I tried using the VC debugger, and then I discovered a discrepancy concerning the fscanf() function declared in stdio.h.

The following code is sufficient to demonstrate the problem:

#include <stdio.h>

int main() {
    unsigned num1, num2, num3;

    FILE *file = fopen("file.bin", "rb");
    fscanf(file, "%u", &num1);
    fgetc(file); // consume and discard \0
    fscanf(file, "%u", &num2);
    fgetc(file); // ditto
    fscanf(file, "%u", &num3);
    fgetc(file); // ditto
    fclose(file);

    printf("%d, %d, %d\n", num1, num2, num3);

    return 0;
}

Assume that file.bin contains exactly 512\0256\0128\0:

$ hexdump -C file.bin
00000000  35 31 32 00 32 35 36 00  31 32 38 00              |512.256.128.|

Now, when being compiled under GCC 4.8.4 on an Ubuntu machine, the resulting program reads the numbers as expected and prints 512, 256, 128 to stdout.
Compiling it with MinGW 4.8.1 on Windows gives the same, expected result.

However, there seems to be a major difference when I compile the code using Visual Studio Community 2015; namely, the output is:

512, 56, 28

As you can see, the trailing null characters have already been consumed by fscanf(), so fgetc() captures and discards characters that are essential to data integrity.

Commenting out the fgetc() lines makes the code work in VC, but breaks it in GCC (and possibly other compilers).

What is going on here, and how do I turn this into portable C code? Have I hit undefined behavior? Note that I'm assuming the C99 standard.


Solution

  • TL;DR: you've been bitten by MSVC non-conformance, a longstanding problem that MS has never shown much interest in solving. If you must support MSVC in addition to conforming C implementations, then one way to do so would be to engage conditional compilation directives to suppress the fgetc() calls when the program is compiled via MSVC.


    I'm inclined to agree with the comments that reading binary data via formatted I/O functions is a questionable plan. Even more questionable, however, is the combination of

    compil[ing] it using Visual Studio on Windows

    and

    assuming the C99 standard.

    As far as I am aware, no version of MSVC conforms to C99. Very recent versions may do a better job of conforming to C2011, in part because C2011 makes some features optional that were mandatory in C99.

    Whichever version of MSVC you're using, however, I think it fails to conform with the standard (both C99 and C2011) in this area. Here is the relevant text from C99, section 7.19.6.2

    A conversion specification is executed in the following steps:

    [...]

    An input item is read from the stream [...]. An input item is defined as the longest sequence of input characters which does not exceed any specified field width and which is, or is a prefix of, a matching input sequence. The first character, if any, after the input item remains unread.

    The standard is quite clear that the first character that does not match the input sequence remains unread, so the only ways MSVC could be considered conforming is if the \0 characters could be construed as being part of (and terminating) a matching input sequence, or if fgetc() were permitted to skip \0 characters. I see no justification for the latter, especially given that the stream was opened in binary mode, so let's consider the former.

    For a u conversion specifier, a matching input sequence is defined as one that

    Matches an optionally signed decimal integer, whose format is the same as expected for the subject sequence of the strtoul function with the value 10 for the base argument.

    The "subject sequence of the strtoul function" is defined in that function's specifications:

    First, they decompose the input string into three parts: an initial, possibly empty, sequence of white-space characters (as specified by the isspace function), a subject sequence resembling an integer represented in some radix determined by the value of base, and a final string of one or more unrecognized characters, including the terminating null character of the input string.

    Note in particular that the terminating null character is explicitly attributed to the final string of unrecognized characters. It is not part of the subject string, and therefore should not be matched by fscanf() when it converts input according to a u specifier.