Search code examples
cnewlinescanfformat-specifiers

Using fscanf() to read a file with lines of 3 numbers each,why does "%d%d%d%*c" act as good as "%d%d%d"?


I know that the %d format specifier,when used here in fscanf(), reads an integer and ignores the white-space preceding it,including the newline(I verified it).But in my following program that uses fscanf() to read from a file of multiple lines with 3 integers each,the format string "%d%d%d%*c" works as good as "%d%d%d".

Why is it so?Since fscanf() used with %d as the first format specifier in the format specifier string ignores any whitespace preceding an integer, why doesn't the extra %*c used as last specifier cause any error or side-effect?Had the %d specifier not been ignoring the newline after each group of 3 numbers in a line,then %*c would have make sense as it would eat away the newline.But why it works without error or side-effect even if fscanf() ignores whitespace for %d by default? Shouldn't fscanf() stop scanning when %*c can't find a character to eat and there is a mismatch between the specifier and the input? Isn't fscanf() supposed to stop when there is a mismatch,just as scanf() does?

EDIT: It even works if I use "%*c%d%d%d"!!Shouldn't the scanning and processing of subsequent characters stop once there is a mismatch between the format specifier and input at the beginning?

#include <stdio.h>
#include <stdlib.h>


int main ()
{
int n1,n2,n3;
FILE *fp;
fp=fopen("D:\\data.txt","r");

if(fp==NULL)
{
printf("Error");
exit(-1);
}

while(fscanf(fp,"%d%d%d%*c",&n1,&n2,&n3)!=EOF) //Works as good as line below
//while(fscanf(fp,"%d%d%d",&n1,&n2,&n3)!=EOF)
printf("%d,%d,%d\n",n1,n2,n3);
fclose(fp);

}

Here's the format of the data in my file data.txt:

243 343 434
393 322 439
984 143 943
438 243 938

Output:

243 343 434
393 322 439
984 143 943
438 243 938

Solution

  • Consider this variation of the program in the question:

    #include <stdio.h>
    #include <stdlib.h>
    
    int main(int argc, char **argv)
    {
        char *file = "D:\\data.txt";
        FILE *fp;
        char *formats[] =
        {
        "%d%d%d%*c",
        "%d%d%d",
        "%*c%d%d%d",
        };
    
        if (argc > 1)
            file = argv[1];
    
        for (int i = 0; i < 3; i++)
        {
            if ((fp = fopen(file, "r")) == 0)
            {
                fprintf(stderr, "Failed to open file %s\n", file);
                break;
            }
            printf("Format: %s\n", formats[i]);
            int n1,n2,n3;
            while (fscanf(fp, formats[i], &n1, &n2, &n3) == 3)
                printf("%d, %d, %d\n", n1, n2, n3);
            fclose(fp);
        }
        return 0;
    }
    

    The repeated opens are not efficient, but that isn't a concern here. Clarity and showing the behaviour is much more important.

    It is written to (a) use a file name specified on the command line so I don't have to futz with names such as D:\data.txt which are very inconvenient to create on Unix systems, and (b) shows the three formats in use.

    Given the data file from the question:

    243 343 434
    393 322 439
    984 143 943
    438 243 938
    

    The output of the program is:

    Format: %d%d%d%*c
    243, 343, 434
    393, 322, 439
    984, 143, 943
    438, 243, 938
    Format: %d%d%d
    243, 343, 434
    393, 322, 439
    984, 143, 943
    438, 243, 938
    Format: %*c%d%d%d
    43, 343, 434
    393, 322, 439
    984, 143, 943
    438, 243, 938
    

    Note that the first digit of the first number is consumed by the %*c when that is the first part of the format. After the first 3 numbers are read, the %*c reads the newline after the third number on the line, then the %d skips further white space (except there isn't any) and reads the number.

    Otherwise, the behaviour is as expounded in the commentary below, largely lifted from another related question.


    Some of the code under discussion in the related question Use fscanf() to read from given line was:

    fscanf(f, "%*d %*d %*d%*c");
    fscanf(f, "%d%d%d", &num1, &num2, &num3);
    

    I noted that the code should test the return value from fscanf(). However, with the three %*d conversion specifications, you might get a return value of EOF if you encountered EOF before reaching the specified line. You've no way of know that the first line contained a letter instead of a digit, unfortunately, until you execute the second fscanf(). You should test the second fscanf() too; you might get EOF, or 0 or 1 or 2 (all of which indicate problems), or you might get 3 indicating success with 3 conversions. Note that adding \n to the format means blank lines will be skipped, but that was going to happen anyway; %d skips white space to the first digit.

    Is there any other way we can read but ignore entire lines like I clumsily did with fscanf(f,"%*d%*d%*d")?Is using %*[^\n] the nearest thing one can do for this?

    The best way to skip whole lines is to use fgets(), as in the last version of the code in my answer. Obviously, there's an outside chance it will miscount lines if any of those lines is longer than 4095 bytes. OTOH, that's fairly improbable.

    I have a confusion now and I don't want to put it in a question. So can you tell me this—fscanf() ignores whitespace automatically, so after the first line, when three integers are read and ignored according to my %*d%*d%*d specifier, I expect fscanf() to ignore the newline too when it starts reading in the next run of the loop. But why doesn't my additional %*c or \n cause problems and the program runs fine when I use %*d%*d%*d%*c or %*d%*d%*d\n in my code?

    You can't tell where anything went wrong with those formats; you can detect EOF, but otherwise, fscanf() will return 0. However, since the %*d skips leading white space — including newlines — it doesn't much matter whether you read the newline after the third number with the %*c or not, and when you have \n there, that's a white space so the read skips the newline and any trailing or leading white space, stopping when it reaches a non-white space character. Of course, you could also have newlines in the middle of the three numbers, or you could have more than three numbers on a line.

    Note that the trailing \n in the format is particularly weird when the user is typing at the terminal. The user hits return, and keeps on hitting return, but the program doesn't continue until the user types a non-blank character. This is why fscanf() is so difficult to use when the data is not reliable. When it's reliable, it's easy, but if anything goes wrong, diagnostics and recovery are painful. That's why it is better to use fgets() and sscanf(); you have control over what is being parsed, you can try again with a different format if you want to, and you can report the whole line, not just what fscanf() has not managed to interpret.

    Note that %c (and %*c) does not skip over white space; therefore, a %*c at the end of the format reads (and discards) the character after the number that was read. If that is the newline, then that's the character read and ignored. The scan set %[...] is the other conversion specification that does not skip white space; all other standard conversion specifications skip leading white space.