Search code examples
casciifgetcnon-printing-characters

Why does fgetc() in C always reads extra, non-existent characters whenever I try to read non-printable characters from txt files?


I am trying to read non-printable characters from a text file, print out the characters' ASCII code, and finally write these non-printable characters into an output file.

However, I have noticed that for every non-printable character I read, there is always an extra non-printable character existing in front of what I really want to read.

For example, the character I want to read is "§". And when I print out its ASCII code in my program, instead of printing just "167", it prints out "194 167".

I looked it up in the debugger and saw "§" in the char array. But I don't have  anywhere in my input file. screenshot of debugger

And after I write the non-printable character into my output file, I have noticed that it is also just "§", not "§".

There is an extra character being attached to every single non-printable character I read. Why is this happening? How do I get rid of it?

Thanks!

Code as follows:

        case 1:
            mode = 1;
            FILE *fp;
            fp = fopen ("input2.txt", "r");
            int charCount = 0;

            while(!feof(fp)) {
                original_message[charCount] = fgetc(fp);
                charCount++;
            }
            original_message[charCount - 1] = '\0';
            fclose(fp);

            k = strlen(original_message);//split the original message into k input symbols
            printf("k: \n%lld\n", k);

            printf("ASCII code:\n");
            for (int i = 0; i < k; i++)
            {
                ASCII = original_message[i];
                printf("%d ", ASCII);
            }

Solution

  • C's getchar (and getc and fgetc) functions are designed to read individual bytes. They won't directly handle "wide" or "multibyte" characters such as occur in the UTF-8 encoding of Unicode.

    But there are other functions which are specifically designed to deal with those extended characters. In particular, if you wish, you can replace your call to fgetc(fp) with fgetwc(fp), and then you should be able to start reading characters like § as themselves.

    You will have to #include <wchar.h> to get the prototype for fgetwc. And you may have to add the call

    setlocale(LC_CTYPE, "");
    

    at the top of your program to synchronize your program's character set "locale" with that of your operating system.

    Not your original code, but I wrote this little program:

    #include <stdio.h>
    #include <wchar.h>
    #include <locale.h>
    
    int main()
    {
        wchar_t c;
        setlocale(LC_CTYPE, "");
        while((c = fgetwc(stdin)) != EOF)
            printf("%lc %d\n", c, c);
    }
    

    When I type "A", it prints A 65. When I type "§", it prints § 167. When I type "Ƶ", it prints Ƶ 437. When I type "†", it prints † 8224.

    Now, with all that said, reading wide characters using functions like fgetwc isn't the only or necessarily even the best way of dealing with extended characters. In your case, it carries a number of additional consequences:

    1. Your original_message array is going to have to be an array of wchar_t, not an array of char.
    2. Your original_message array isn't going to be an ordinary C string — it's a "wide character string". So you can't call strlen on it; you're going to have to call wcslen.
    3. Similarly, you can't print it using %s, or its characters using %c. You'll have to remember to use %ls or %lc.

    So although you can convert your entire program to use "wide" strings and "w" functions everywhere, it's a ton of work. In many cases, and despite anomalies like the one you asked about, it's much easier to use UTF-8 everywhere, since it tends to Just Work. In particular, as long as you don't have to pick a string apart and work with its individual characters, or compute the on-screen display length of a string (in "characters") using strlen, you can just use plain C strings everywhere, and let the magic of UTF-8 sequences take care of any non-ASCII characters your users happen to enter.