Search code examples
cwindowsutf-8mingwmingw-w64

How input UTF-8 characters in MingW64?


Platform: Windows x64 22H2

I have the following code (File encoding format: UTF-8):

#include <stdio.h>

int main(int argc, char **argv)
{
    static char text[8];
    scanf("%[^\n]s", text);
    printf("%s\n", text);
    return 0;
}

It works properly when only characters from the ASCII table are input.
But when inputting characters such as Chinese or other Unicode encodings, it will not read.

If Unicode characters is input, the content of the text array is: 00 00 00 00 00 00 00 00. I executed this program in Windows CMD, and the compilation instructions are: gcc main.c -o main.exe.

I am trying to add local support, and this is the modified code:

#include <stdio.h>
#include <locale.h>

int main(int argc, char **argv)
{
    setlocale(LC_ALL, "zh_CN.UTF-8");
    static char text[8];
    scanf("%[^\n]s", text);
    printf("%s\n", text);
    return 0;
}

But the content of this array is still: 00 00 00 00 00 00 00 00.

I tried to change the page number of CMD to 65001 again (chcp 65001), but the result was still the same. I also tried adding the gcc command line parameter -finput-charset=UTF-8, but it still didn't work.

But when I modify the code file to the encoding of GB series (such as GB2312) or change the page number of CMD to 936, it can read the data encoded in GB2312 normally, like this:

input: 你好
output: ce d2 b5 c4 00 00 00 00

This can read Unicode characters, but not UTF-8 encoding.


Solution

  • In a bash shell with locale set to LANG=en_US.UTF-8, this correctly reads a UTF-8 string.

    #include <stdio.h>
    #include <string.h>
    
    int main(int argc, char **argv)
    {
        char text[100];
        scanf("%99s", text);
        printf("%s\n", text);
        for (int i=0; i < strlen(text); i++)
            printf(" %02x",(unsigned char) text[i]);
        printf("\n");
        return 0;
    }
    
    
    快速的棕色狐狸
    快速的棕色狐狸
     e5 bf ab e9 80 9f e7 9a 84 e6 a3 95 e8 89 b2 e7 8b 90 e7 8b b8