Search code examples
c++stringunicodeutfwidechar

std::string is natively encoded in UTF-8 but char can not hold utf characters?


After reading std::wstring VS std::string, I was under the impression that for Linux, I don't need to worry about using any wide character facilities of the language.
*things like: std::wifstream, std::wofstream, std::wstring, whar_t, etc.

This seems to go fine when I'm using only std::strings for the non-ascii characters, but not when I'm using chars to handle them.

For example: I have a file with just a unicode checkmark in it.
I can read it in, print it to the terminal, and output it to a file.

// ✓ reads in unicode to string
// ✓ outputs unicode to terminal
// ✓ outputs unicode back to the file
#include <iostream>
#include <string>
#include <fstream>

int main(){
  std::ifstream in("in.txt");
  std::ofstream out("out.txt");

  std::string checkmark;
  std::getline(in,checkmark); //size of string is actually 3 even though it just has 1 unicode character

  std::cout << checkmark << std::endl;
  out << checkmark;

}

The same program does not work however, if I use a char in place of the std::string:

// ✕ only partially reads in unicode to char
// ✕ does not output unicode to terminal
// ✕ does not output unicode back to the file
#include <iostream>
#include <string>
#include <fstream>

int main(){
  std::ifstream in("in.txt");
  std::ofstream out("out.txt");

  char checkmark;
  checkmark = in.get();

  std::cout << checkmark << std::endl;
  out << checkmark;

}

nothing appears in the terminal(apart from a newline).
The output file contains â instead of the checkmark character.

Since a char is only one byte, I could try to use a whar_t, but it still does not work:

// ✕ only partially reads in unicode to char
// ✕ does not output unicode to terminal
// ✕ does not output unicode back to the file
#include <iostream>
#include <string>
#include <fstream>

    int main(){
      std::wifstream in("in.txt");
      std::wofstream out("out.txt");

      wchar_t checkmark;
      checkmark = in.get();

      std::wcout << checkmark << std::endl;
      out << checkmark;

    }

I've also read about setting the following locale, but it does not appear to make a difference.

setlocale(LC_ALL, "");

Solution

  • In the std::string case you read one line, which in our case contains a multi-byte Unicode character. In the char case you read a single byte, which is not even a single complete character.

    Edit: for UTF-8 you should read into an array of char. Or just std::string since that already works.