I want to read a short line from UTF-8 file and display it in Windows console.
I succeeded with MultiByteToWideChar Winapi function:
void mbtowchar(const char* input, WCHAR* output) {
int len = MultiByteToWideChar(CP_UTF8, 0, input, -1, NULL, 0);
MultiByteToWideChar(CP_UTF8, 0, input, -1, output, len);
}
void main() {
setlocale(LC_ALL,"");
char in[256];
FILE* file = fopen("data.txt", "r");
fgets(in, 255, file);
fclose(file);
mbtowchar(in, out);
printf("%ls",out);
}
...but I failed with ISO mbsrtowcs function (non-ASCII chars are messed):
void main() {
setlocale(LC_ALL,"");
char in[256];
wchar_t out[256];
FILE* file = fopen("data.txt", "r");
fgets(in, 255, file);
fclose(file);
const char* p = in;
mbstate_t mbs = 0;
mbsrtowcs(out, &p, 255, &mbs);
printf("%ls",out);
}
Do I do something wrong with mbsrtowcs or is there some important difference between these two functions? Is it possible to reliably print UTF-8 in windows console using ISO functions? (Assuming matching console font is installed.)
Notes: I use MinGW gcc compiler. C++ is the last resort solution for me, I'd like to stay with C.
What's "wrong" with mbsrtowcs
is that it converts from a system-defined variable-width encoding of 8-bit characters (char
) to a fixed-width array of "wide" characters (wchar_t
). Wide characters are today understood as Unicode code points, but "multi-byte" does not necessarily imply UTF-8. On Windows it in fact refers to various pre-Unicode encodings of Asian scripts. Frustratingly, Windows doesn't support UTF-8 as a native "multi-byte" encoding at all, and apparently never will.
Thus attempts to use mbsrtowcs
to interpret UTF-8 are doomed to fail on Win32. You will have to use MultiByteToWideChar
, as your first snippet does, or switch to some other means of converting UTF-8 to UTF-16. (Since UTF-8 and UTF-16 both encode UCS code points, you could even write a simple routine of your own to do that, if your goal is to avoid depending on proprietary extensions.)