Search code examples
c++character-encodingcharacterwchar-twchar

C++ - How to store a Chinese character in a char or similar?


I am trying to store a Chinese character in a variable of type wchar_t and print out this character. However, the program incorrectly prints a ?. Here is my code.

#include <iostream>

int main() {
    using std::wcout;
    using std::endl;
    wchar_t c = L'人'; // 人 is a Chinese character
    wcout << c << endl;
}
?

Changing wchar_t to char will cause the program not to compile.

P.S. The encoding of my terminal is UTF-8.


Solution

  • wchar_t is outdated

    You may have noticed that there are two version of Windows API: the A version and the W version.

    W APIs accept wchar_t as their normal parameters. A wchar_t on windows is 2-bytes long, the encoding of a wchar_t string is something named "UTF16LE" or "UCS-2", which means store every character in 16 bits (two bytes) with a byte order of little endian.

    But 2 bytes can only represent 2^16 (65536) characters, it can't represent the full Unicode character set.

    See this answer

    Note that the length of wchar_t is a platform-defined value which varies among platforms. For example, on Linux it is 4-bytes long. If you are making universal applications, it's bad idea to have wchar_t in your code.

    What to do

    So back to the question, how can we store such characters in your program?

    Firstly a Chinese character is not a char. It's a string. It contains 3 bytes in UTF-8 and 2 bytes in UTF-16.

    So you should do it this way:

    #include <iostream>
    int main(){
        using std::endl;
        using std::cout;
    
        char c[]{ "人" };
        cout<<c<<endl;
    }
    

    c is declared as char[] so it can hold a string.

    Note that there is no = when defining c, it is a c++-only syntax. If you are writing C, you should instead write:

    char c[] = "人";
    

    But it may still FAIL to print! Why?

    In default the encoding of Windows console is CP936 (GBK) for zh-CN language. MSVC uses that encoding, too. (I don't know clearly about this, needs testing) So if your source file is GBK, your compiler is, and your console is, you will get the right output.

    But if one encoding mismatches, your program will still fail.

    It's never considered good idea to print non-ascii strings to console, especially on windows. You can write them to a file, send them through network, show them with Win32 GUI, or even make your own character rendering engine. Don't rely on the console too much.

    Trivia

    chcp is used to set encoding in windows console. For example:

    :: GBK
    chcp 936
    :: UTF-8
    chcp 65001