Search code examples
c++findwstringwofstreamwifstream

Failing to find a wchar_t that is present in a std::wstring


I was playing with std::wstring and std::wfstream, when I encountered a strange behaviour. Namely, it appears that std::basic_string<wchar_t>::find fails to find certain characters. Consider the following code:

int main()
{
    std::wifstream input("input.txt");
    std::wofstream output("output.txt");

    if(!(input && output)){
        std::cerr << "file(s) not opened";
        return -1;
    }

    std::wstring buf;
    std::getline(input, buf);

    output << buf;

    std::cout << buf.find(L'ć');
}

Here I am simply reading the first line of the input file and writing it to the output file. Before the program runs, the content of the first file is aąbcćd and the output file is empty. After executing the code, the input file is successfully copied into the output file.

What's surprising to me is that I tried to find a ć letter in the buf and encountered the mentioned strange behaviour. After the program executed, I confirmed that the output file contains exactly aąbcćd, which obviously contains the mentioned character ć.

However, the line std::cout << buf.find(L'ć') did not behave as expected. I wasn't expecting to get an output of 4, given the memory layout of std::wstring, but I also definitely did not expect to get std::string::npos. It's worth mentioning that finding regular ASCII characters with this method succeeds.

To sum up, the mentioned code correctly copies the first line of input file to output file, but it fails to find a character in a string (returning npos), that is responsible of holding the data that is to be copied. Why is that so? What causes the find to fail here?

Note: both of the files are UTF-8 encoded on Windows.


Solution

  • Unfortunately wchar_t isn't UTF-8, its UTF-16(on Windows) and no magic conversion happens when you read a UTF-8 file. If you debug your program you'll see corrupted characters in your buf variable.

    You either need to read your string as a std::string then convert from UTF-8 to whar_t or work in UTF-8 and convert your literal string from whcar_t to std::string of UTF-8 characters.

    If you are using a recent compiler you can use the following to create a UTF-8 string literal:

    u8"ć"
    

    The following should work:

    int main()
    {
        std::ifstream input("input.txt");
        std::ofstream output("output.txt");
    
        if(!(input && output)){
            std::cerr << "file(s) not opened";
            return -1;
        }
    
        std::string buf;
        std::getline(input, buf);
    
        output << buf;
    
        std::cout << buf.find(u8"ć");
    }