Search code examples
cutf-8windows-console

UTF8 console output: MultiByteToWideChar vs mbsrtowcs


I want to read a short line from UTF-8 file and display it in Windows console.

I succeeded with MultiByteToWideChar Winapi function:

void mbtowchar(const char* input, WCHAR* output) {
  int len = MultiByteToWideChar(CP_UTF8, 0, input, -1, NULL, 0);
  MultiByteToWideChar(CP_UTF8, 0, input, -1, output, len);
}

void main() {
  setlocale(LC_ALL,"");
  char in[256];

  FILE* file = fopen("data.txt", "r");
  fgets(in, 255, file);
  fclose(file);

  mbtowchar(in, out);
  printf("%ls",out);
}

...but I failed with ISO mbsrtowcs function (non-ASCII chars are messed):

void main() {
  setlocale(LC_ALL,"");
  char in[256];
  wchar_t out[256];

  FILE* file = fopen("data.txt", "r");
  fgets(in, 255, file);
  fclose(file);

  const char* p = in;
  mbstate_t mbs = 0;
  mbsrtowcs(out, &p, 255, &mbs);

  printf("%ls",out);
}

Do I do something wrong with mbsrtowcs or is there some important difference between these two functions? Is it possible to reliably print UTF-8 in windows console using ISO functions? (Assuming matching console font is installed.)

Notes: I use MinGW gcc compiler. C++ is the last resort solution for me, I'd like to stay with C.


Solution

  • What's "wrong" with mbsrtowcs is that it converts from a system-defined variable-width encoding of 8-bit characters (char) to a fixed-width array of "wide" characters (wchar_t). Wide characters are today understood as Unicode code points, but "multi-byte" does not necessarily imply UTF-8. On Windows it in fact refers to various pre-Unicode encodings of Asian scripts. Frustratingly, Windows doesn't support UTF-8 as a native "multi-byte" encoding at all, and apparently never will.

    Thus attempts to use mbsrtowcs to interpret UTF-8 are doomed to fail on Win32. You will have to use MultiByteToWideChar, as your first snippet does, or switch to some other means of converting UTF-8 to UTF-16. (Since UTF-8 and UTF-16 both encode UCS code points, you could even write a simple routine of your own to do that, if your goal is to avoid depending on proprietary extensions.)