Search code examples
c++fileinputwofstreamwifstream

Reading and writing files in Cyrillic in c++


I have to first read a file in Cyrillic, then randomly pick random number of lines and write modified text to a different file. No problem with Latin letter, but I run into a problem with Cyrillic text, because I get some rubbish. So this is how I tried to do the thing.

Say, file input.txt is

ааааааа
ббббббб
ввввввв

I have to read it, and put every line into a vector:

vector<wstring> inputVector;
wstring inputString, result;
wifstream inputStream;
inputStream.open("input.txt");
while(!inputStream.eof())
{
    getline(inputStream, inputString);              
    inputVector.push_back(inputString);
}
inputStream.close();    

srand(time(NULL));
int numLines = rand() % inputVector.size();
for(int i = 0; i < numLines; i++)
{
    int randomLine = rand() % inputVector.size();
    result += inputVector[randomLine];
}

wofstream resultStream;
resultStream.open("result.txt");
resultStream << result;
resultStream.close();

So how can I do work with Cyrillic so it produces readable things, not just symbols?


Solution

  • Because you saw something like ■a a a a a a a 1♦1♦1♦1♦1♦1♦1♦ 2♦2♦2♦2♦2♦2♦2♦ printed to the console, it appears that input.txt is encoded in a UTF-16 encoding, probably UTF-16 LE + BOM. You can use your original code if you change the encoding of the file to UTF-8.

    The reason for using UTF-8 is that, regardless of the char type of the file stream, basic_fstream's underlying basic_filebuf uses a codecvt object to convert a stream of char objects to/from a stream of objects of the char type; i.e. when reading, the char stream that is read from the file is converted to a wchar_t stream, but when writing, a wchar_t stream is converted to a char stream that is then written to the file. In the case of std::wifstream, the codecvt object is an instance of the standard std::codecvt<wchar_t, char, mbstate_t>, which generally converts UTF-8 to UCS-16.

    As explained on the MSDN documentation page for basic_filebuf:

    Objects of type basic_filebuf are created with an internal buffer of type char * regardless of the char_type specified by the type parameter Elem. This means that a Unicode string (containing wchar_t characters) will be converted to an ANSI string (containing char characters) before it is written to the internal buffer.

    Similarly, when reading a Unicode string (containing wchar_t characters), the basic_filebuf converts the ANSI string read from the file to the wchar_t string returned to getline and other read operations.

    If you change the encoding of input.txt to UTF-8, your original program should work correctly.

    For reference, this works for me:

    #include <cstdlib>
    #include <ctime>
    #include <fstream>
    #include <iostream>
    #include <string>
    #include <vector>
    
    int main()
    {
        using namespace std;
    
        vector<wstring> inputVector;
        wstring inputString, result;
        wifstream inputStream;
        inputStream.open("input.txt");
        while(!inputStream.eof())
        {
            getline(inputStream, inputString);
            inputVector.push_back(inputString);
        }
        inputStream.close();
    
        srand(time(NULL));
        int numLines = rand() % inputVector.size();
        for(int i = 0; i < numLines; i++)
        {
            int randomLine = rand() % inputVector.size();
            result += inputVector[randomLine];
        }
    
        wofstream resultStream;
        resultStream.open("result.txt");
        resultStream << result;
        resultStream.close();
    
        return EXIT_SUCCESS;
    }
    

    Note that the encoding of result.txt will also be UTF-8 (generally).