Search code examples
c++fstream

fstream not working properly with russian text?


I work with russian a lot and I've been trying to get data from a file with an input stream. Here's the code, it's supposed to output only the words that contain no more than 5 characters.

#include <iostream>
#include <fstream>
#include <string>
#include <Windows.h>
using namespace std;
int main()
{
    setlocale(LC_ALL, "ru_ru.utf8");
    ifstream input{ "in_text.txt" };
    if (!input) {
        cerr << "Ошибка при открытии файла" << endl;
        return 1;
    }
    cout << "Вывод содержимого файла: " << "\n\n";
    string line{};
    while (input >> line) {
        if (line.size() <= 5)
            cout << line << endl;
    }
    cout << endl;

    input.close();
    return 0;
}

Here's the problem:

I noticed the output didn't pick up all of the words that were actually containing less than 5 characters. So I did a simple test with the word "Test" in english and the translation "тест" in russian, the same number of characters. So my text file would look like this:

Test тест

I used to debugger to see how the program would run and it printed out the english word and left the russian. I can't understand why this is happening.

P.S. When I changed the code to if (line.size() <= 8) it printed out both of them. Very odd

I think I messed up my system locale somehow I don't know. I did one time try to use std::locale without really understanding it, maybe that did something to my PC I'm not really sure. Please help


Solution

  • I'm very unsure about this but using codecvt_utf8 and wstring_convert seems to work:

    #include <codecvt>   // codecvt_utf8
    #include <string>
    #include <iostream>
    #include <locale>    // std::wstring_convert
    
    int main() {
        // ...
    
        while (input >> line) {
            // convert the utf8 encoded `line` to utf32 encoding:
            std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> u8_to_u32;
            std::u32string u32s = u8_to_u32.from_bytes(line);
    
            if (u32s.size() <= 5)           // check the utf32 length
                std::cout << line << '\n';  // but print the utf8 encoded string
        }
    
        // ...
    }
    

    Demo