Search code examples
c++powershellstdoutcoutsetlocale

Why does LC_ALL setlocale setting affect cout output in Powershell?


I'm trying to understand some behavior I'm seeing.

I have this C++ program:

// Outputter.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include <iostream>


int main()
{
    // UTF-8 bytes for "日本語"
    std::cout << (char)0xE6 << (char)0x97 << (char)0xA5 << (char)0xE6 << (char)0x9C << (char)0xAC << (char)0xE8 << (char)0xAA << (char)0x9E;
    return 0;
}

If I run the following in Powershell:

[System.Console]::OutputEncoding = [System.Console]::InputEncoding = [System.Text.Encoding]::UTF8
.\print_it.exe # This is the above program ^
日本語 # This is the output as displayed in Powershell

Then 日本語 is printed and displayed correctly in Powershell.

However if I add setlocale(LC_ALL, "English_United States.1252"); to the code, like this:

int main()
{
    setlocale(LC_ALL, "English_United States.1252");

    // UTF-8 bytes for "日本語"
    std::cout << (char)0xE6 << (char)0x97 << (char)0xA5 << (char)0xE6 << (char)0x9C << (char)0xAC << (char)0xE8 << (char)0xAA << (char)0x9E;
    return 0;
}

The program now prints garbage to Powershell (日本語 to be precise, which is the code page 1252 misinterpretation of those bytes).

BUT if I pipe the output to a file and then cat the file, it looks fine:

.\print_it.exe > out.txt
cat out.txt
日本語 # It displays fine, like this, if I redirect to a file and cat the file.

Also, Git bash displays the output properly no matter what I setlocale to.

Could someone please help me understand why setlocale is affecting how the output is displayed in Powershell, even though the same bytes are being written to stdout? It seems like Powershell is somehow able to access the locale of the program and uses that to interpret output?

Powershell version is 5.1.17763.592.


Solution

  • It is all about encoding. The reason why you are getting correct characters with the > redirect is due to the fact the > redirect uses UTF-16LE by default. So your set encoding 1252 is automagically converted to UTF-16.

    Depending on your PowerShell version you can or can not change the encoding of the redirect.

    If you would use Out-File with -Encoding switch you could change the encoding of the destination file (again depends on your PowerShell version).

    I recommend reading SO excellent mklement0's post on this topic here.

    Edit based on comment

    Taken from cppreference

    std::setlocale C++ Localizations library Defined in header <clocale>

    char* setlocale( int category, const char* locale);

    The setlocale function installs the specified system locale or its portion as the new C locale. The modifications remain in effect and influences the execution of all locale-sensitive C library functions until the next call to setlocale. If locale is a null pointer, setlocale queries the current C locale without modifying it.

    The bytes you are sending to std::cout are the same, but std::cout is a locale-sensitive function so it take precedence over your PowerShell UTF-8 settings. If you leave out the setlocale() function the std::cout obeys the shell encoding.

    If you have Powershell 5.1 and above the > is an alias for Out-File. You can set the encoding via $PSDefaultParameterValues:

    like this:

    $PSDefaultParameterValues['Out-File:Encoding'] = 'UTF8'
    

    Then you would get an UTF-8 file (with BOM which can be annoying!) instead of the default UTF-16LE.

    Edit - adding some details as requested by OP

    PowerShell is using OEM code page so by default you are getting what you have setup at your windows. I recommend reading an excelent post on encoding on windows. The point is that without your UTF8 setting to the powershell you are on your code page which you have.

    The output.exe is setting the locales to English_United States.1252 within the c++ program and output_original.exe is not doing any changes to it:

    Here is the output without the UTF8 PowerShell setting:

    c:\t>.\output.exe
    æ-¥æo¬èªz  --> nonsese within the win1252 code page
    c:\t>.\output.exe | hexdump
    0000000 97e6 e6a5 ac9c aae8 009e --> both hex outputs are the same!
    0000009
    c:\t>.\output_original.exe
    日本語  --> nonsense but different one! (depens on your locale setup - my was English)
    c:\t>.\output_original.exe | hexdump
    0000000 97e6 e6a5 ac9c aae8 009e  --> both hex outputs are the same!
    0000009
    

    So what happens here? Your program gives out an output based either on the locale set in the program itself or windows (which is OEM code 1252 at my virtual machine). Notice that in both versions the hexdump is the same, but not the output (with encoding).

    If you set your PowerShell to UTF8 with the [System.Text.Encoding]::UTF8:

    PS C:\t> [System.Console]::OutputEncoding = [System.Console]::InputEncoding = [System.Text.Encoding]::UTF8
    PS C:\t> .\output.exe 
    日本語  --> the english locales 1252 set within program notice that the output is similar to the above one (but the hexdump is different)
    PS C:\t> .\output.exe | hexdump
    0000000 bbef 3fbf 3f3f 0a0d  -> again hex dump is same for both so they are producing the same output!
    0000008
    PS C:\t> .\output_original.exe
    日本語 --> correct output due to the fact you have forced the PowerShell encoding to UTF8, thus removing the output dependence on the OEM code (windows)
    PS C:\t> .\output_original.exe | hexdump
    0000000 bbef 3fbf 3f3f 0a0d -> again hex dump is same for both so they are producing the same output!
    0000008
    

    What happens here? If you force the locales at your c++ application the std:cout will be formatted with that locales (1252) those characters are then transformed into UTF8 formatting (that is the reason why the first and second examples are little bit different). When you do not force the locales in your c++ application then the PowerShell encoding is taken, which is now UTF8 and you get correct output.

    One thing that is I found interesting is if you change your windows system locales to chinese compatible ones (PRC, Macao, Tchaiwan, Hongkong, etc.) you will get some chinese charactes when not forcing UTF8, but different ones. That means that those bytes are Unicode only and thus only there it works. If you force the UTF8 at PowerShell even with the chinese windows system locales it works correctly.

    I hope this answers your question to greater extent.

    Rant: It took me so long to investigate because the VS 2019 community edition got expired (WFT MS?) and I could not registre it because the register window was completely blank. Thanks MS but no thanks.