Search code examples
c++visual-c++consolelocalembcs

Why printf can display non-ASCII characters when "C" locale is used?


Note: I'm asking an implementation defined behavior which is on Microsoft Visual C++ 2008(possibly the same on 2005+). OS: simplified Chinese installation of Win7.

It surprises me when I'm performing non-ASCII I/O w/ printf. E.g.

   // This won't be necessary as it's the system default code page.
   //system("chcp 936");
   
   // NULL to show current locale, which is "C"
   printf ("%s\n", setlocale(LC_ALL, NULL));
   printf ("中\n");
   printf ("%s\n", setlocale(LC_ALL, "English"));
   printf ("中\n");

Output:

Active code page: 936
C
中
English_United States.1252
?D

The memory footprint in debugger shows that "中" is encoded in two bytes: 0xD6, 0xD0, which is the code point of that character in code page 936, for simplified Chinese. It shouldn't be in the code point range of "C" locale which, most likely, is 0x0 ~ 0x7F.

Question:

Why can it still display the character correctly in "C" locale? So I made a guess that locale had no bearing on printf? But then, I shall ask, why can't it display anymore when changing to "English" locale, which is also different from 936? Interesting?

Edit:

I redirected the standard output to a file and took some test. It shows that whatever locale is set, the correct character "中" is saved in the file. It suggests that setlocale() is connected to the way console displays the character, which contradicts my understanding of how it works: printf puts the bytes/code points into input buffer of console, which interprets these bytes using its own code page(what chcp returns).


Solution

  • OK. For the default "C" locale, CRT assumes that characters passed to printf don't need any conversion. It has a reason because the ASCII characters almost always fall into the basic character set of the execution system(shared among different Windows code pages). When switched to "English", it assumes the input is encoded in code page 1252, and thus tries to perform a conversion from "English" to "Chinese", which is the locale used by the console. But CRT just cannot find the character in code page 1252. That's why it outputs a question mark.

    When redirected to a file, CRT knows it and won't do the conversion, because the console code page is no longer used. It just passes through the bytes as-is. How those bytes are interpreted is up to the program you use(e.g., care about BOM or not) when you open the file.

    Refer to this MSDN forum link: Why printf can display non-ASCII characters when “C” locale is used?