Search code examples
c++linuxchinese-localewstring

Reading a file that contains chinese characters (C++)


I got issues reading a file that contains chinese characters. I know that the encoding of the file is Big5.

Here is my example file (test.txt), I can't include it here because of the chinese characters: https://gist.github.com/haruka98/974ca2c034ebd8fe7eeac4124739fc41

This is my minimal code example (main.cpp), the one I'm actually using breaks down each line and does things with the different fields.

#include <string>
#include <fstream>
#include <iostream>

int main(int argc, char* argv[]) {
    setlocale(LC_ALL, "Chinese-traditional");
    std::wstring wstr;
    std::wifstream input_file("test.txt");
    std::wofstream output_file("test_output.txt");
    int counter = 0;
    while(std::getline(input_file, wstr)) {
        for(int i = 0; i < wstr.size(); i++) {
            if(wstr[i] == L'|') {
                counter++;
            }
        }
        output_file << wstr << std::endl;
    }
    input_file.close();
    output_file.close();
    std::cout << counter << std::endl;
    return 0;
}

To compile my program:

g++ -o test main.cpp -std=c++17

On Windows 10 I got my expected output. I got the entire file copied to "test_output.txt" and the 129 output in the terminal.

On Linux (Debian 9) I got the terminal output 4 and the file "test_output.txt" only contains the first line and the "1|" from the second.

Here is what I tried:

My first guess was the CR LF and LF issue when using both Windows and Linux. But testing both CR LF and LF with the file did not help.

Then I thought that the "Chinese-traditional" might not work on Linux. I replaced it with "zh_TW.BIG5" but did not get the expected result either.


Solution

  • First check you have the locale for "Chinese-traditional" installed. On Linux this is zh_TW.UTF-8. You can check using locale -a. If it's not listed, install it:

    sudo locale-gen zh_TW.UTF-8
    sudo update-locale
    

    (There's a list of locales here with their names on Linux and Windows.)

    Then use imbue with the input and output streams to set the locale of the streams.

    By default, std::wcout is synchronized to the underlying stdout C stream, which uses an ASCII mapping and displays ? in place of Unicode characters it cannot handle. If you want to print Unicode characters to the terminal, you have to turn that synchronization off. You can do that with one line and set the locale of the terminal:

    std::ios_base::sync_with_stdio(false);
    std::wcout.imbue(loc);
    

    Amended version of your code:

    #include <string>
    #include <locale>
    #include <fstream>
    #include <iostream>
    
    int main(int argc, char* argv[])
    {
        auto loc = std::locale("zh_TW.utf8");
    
        //Disable synchronisation with stdio & set locale
        std::ios::sync_with_stdio(false);
        std::wcout.imbue(loc);
    
        //Set locale of input stream
        std::wstring wstr;
        std::wifstream input_file("test.txt");
        input_file.imbue(loc);
    
        //Set locale of outputput stream
        std::wofstream output_file("test_output.txt");
        output_file.imbue(loc);
    
        int counter = 0;
        while(std::getline(input_file, wstr)) {
            for(int i = 0; i < wstr.size(); i++) {
                if(wstr[i] == L'|') {
                    counter++;
                }
            }
            std::wcout << wstr << std::endl;
            output_file << wstr << std::endl;
        }
        input_file.close();
        output_file.close();
        std::wcout << counter << std::endl;
        return 0;
    }