I'm trying to understand some behavior I'm seeing.
I have this C++ program:
// Outputter.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <iostream>
int main()
{
// UTF-8 bytes for "日本語"
std::cout << (char)0xE6 << (char)0x97 << (char)0xA5 << (char)0xE6 << (char)0x9C << (char)0xAC << (char)0xE8 << (char)0xAA << (char)0x9E;
return 0;
}
If I run the following in Powershell:
[System.Console]::OutputEncoding = [System.Console]::InputEncoding = [System.Text.Encoding]::UTF8
.\print_it.exe # This is the above program ^
日本語 # This is the output as displayed in Powershell
Then 日本語
is printed and displayed correctly in Powershell.
However if I add setlocale(LC_ALL, "English_United States.1252");
to the code, like this:
int main()
{
setlocale(LC_ALL, "English_United States.1252");
// UTF-8 bytes for "日本語"
std::cout << (char)0xE6 << (char)0x97 << (char)0xA5 << (char)0xE6 << (char)0x9C << (char)0xAC << (char)0xE8 << (char)0xAA << (char)0x9E;
return 0;
}
The program now prints garbage to Powershell (日本語
to be precise, which is the code page 1252 misinterpretation of those bytes).
BUT if I pipe the output to a file and then cat the file, it looks fine:
.\print_it.exe > out.txt
cat out.txt
日本語 # It displays fine, like this, if I redirect to a file and cat the file.
Also, Git bash displays the output properly no matter what I setlocale
to.
Could someone please help me understand why setlocale is affecting how the output is displayed in Powershell, even though the same bytes are being written to stdout? It seems like Powershell is somehow able to access the locale of the program and uses that to interpret output?
Powershell version is 5.1.17763.592.
It is all about encoding. The reason why you are getting correct characters with the >
redirect is due to the fact the >
redirect uses UTF-16LE by default. So your set encoding 1252 is automagically converted to UTF-16.
Depending on your PowerShell version you can or can not change the encoding of the redirect.
If you would use Out-File
with -Encoding
switch you could change the encoding of the destination file (again depends on your PowerShell version).
I recommend reading SO excellent mklement0's post on this topic here.
Taken from cppreference
std::setlocale C++ Localizations library Defined in header
<clocale>
char* setlocale( int category, const char* locale);
The setlocale function installs the specified system locale or its portion as the new C locale. The modifications remain in effect and influences the execution of all locale-sensitive C library functions until the next call to setlocale. If locale is a null pointer, setlocale queries the current C locale without modifying it.
The bytes you are sending to std::cout
are the same, but std::cout
is a locale-sensitive function so it take precedence over your PowerShell UTF-8 settings. If you leave out the setlocale()
function the std::cout
obeys the shell encoding.
If you have Powershell 5.1 and above the >
is an alias for Out-File
. You can set the encoding via $PSDefaultParameterValues
:
like this:
$PSDefaultParameterValues['Out-File:Encoding'] = 'UTF8'
Then you would get an UTF-8 file (with BOM which can be annoying!) instead of the default UTF-16LE.
PowerShell is using OEM code page so by default you are getting what you have setup at your windows. I recommend reading an excelent post on encoding on windows. The point is that without your UTF8 setting to the powershell you are on your code page which you have.
The output.exe
is setting the locales to English_United States.1252
within the c++ program and output_original.exe
is not doing any changes to it:
Here is the output without the UTF8 PowerShell setting:
c:\t>.\output.exe
æ-¥æo¬èªz --> nonsese within the win1252 code page
c:\t>.\output.exe | hexdump
0000000 97e6 e6a5 ac9c aae8 009e --> both hex outputs are the same!
0000009
c:\t>.\output_original.exe
日本語 --> nonsense but different one! (depens on your locale setup - my was English)
c:\t>.\output_original.exe | hexdump
0000000 97e6 e6a5 ac9c aae8 009e --> both hex outputs are the same!
0000009
So what happens here? Your program gives out an output based either on the locale set in the program itself or windows (which is OEM code 1252 at my virtual machine). Notice that in both versions the hexdump is the same, but not the output (with encoding).
If you set your PowerShell to UTF8 with the [System.Text.Encoding]::UTF8
:
PS C:\t> [System.Console]::OutputEncoding = [System.Console]::InputEncoding = [System.Text.Encoding]::UTF8
PS C:\t> .\output.exe
日本語 --> the english locales 1252 set within program notice that the output is similar to the above one (but the hexdump is different)
PS C:\t> .\output.exe | hexdump
0000000 bbef 3fbf 3f3f 0a0d -> again hex dump is same for both so they are producing the same output!
0000008
PS C:\t> .\output_original.exe
日本語 --> correct output due to the fact you have forced the PowerShell encoding to UTF8, thus removing the output dependence on the OEM code (windows)
PS C:\t> .\output_original.exe | hexdump
0000000 bbef 3fbf 3f3f 0a0d -> again hex dump is same for both so they are producing the same output!
0000008
What happens here? If you force the locales at your c++ application the std:cout
will be formatted with that locales (1252) those characters are then transformed into UTF8 formatting (that is the reason why the first and second examples are little bit different). When you do not force the locales in your c++ application then the PowerShell encoding is taken, which is now UTF8 and you get correct output.
One thing that is I found interesting is if you change your windows system locales to chinese compatible ones (PRC, Macao, Tchaiwan, Hongkong, etc.) you will get some chinese charactes when not forcing UTF8, but different ones. That means that those bytes are Unicode only and thus only there it works. If you force the UTF8 at PowerShell even with the chinese windows system locales it works correctly.
I hope this answers your question to greater extent.
Rant: It took me so long to investigate because the VS 2019 community edition got expired (WFT MS?) and I could not registre it because the register window was completely blank. Thanks MS but no thanks.