Search code examples
pythonwindowspowershellencodingutf-8

How do I get UTF-8 to work flawlessly in modern PowerShell on Windows?


I have a C++ program which outputs raw UTF-8 and works flawlessly on Linux, but on Windows shells the output is not as nice. "®" turns into "┬«", "©" turns into "┬⌐", for example. There is also a Python part to the code, which seems to work better when printing to the shell, so I tried to test Python output a bit.

PS C:\Users\user> python -c 'print("\N{GREEK CAPITAL LETTER DELTA}")' > test_file_python.txt
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0394' in position 0: character maps to <undefined>
PS C:\Users\user> python -X utf8 -c 'print("\N{GREEK CAPITAL LETTER DELTA}")' > test_file_python.txt
PS C:\Users\user> cat test_file_python.txt
Δ
PS C:\Users\user> python -c 'print("\N{GREEK CAPITAL LETTER DELTA}")'
Δ
PS C:\Users\user> cat .\test_file_python_wsl.txt  # Generated in WSL with the above commands
Δ
PS C:\Users\user> Format-Hex .\test_file_python.txt

   Label: C:\Users\user\test_file_python.txt

          Offset Bytes                                           Ascii
                 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
          ------ ----------------------------------------------- -----
0000000000000000 E2 95 AC C3 B6 0D 0A                            �ö��

PS C:\Users\user> Format-Hex .\test_file_python_wsl.txt

   Label: C:\Users\user\test_file_python_wsl.txt

          Offset Bytes                                           Ascii
                 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
          ------ ----------------------------------------------- -----
0000000000000000 CE 94 0A                                        ��

I do not understand how PowerShell works with encoding, how can Python do this right when writing to the shell but not when redirecting, and why something that works perfectly in Linux Bash in WSL has this sort of issues in the newer cross-platform PowerShell Core which should "just work". These are multiple questions, but probably have a common answer.

EDIT: I forgot to add some important information, I am using PowerShell Core v7.3.6 with this encoding settings:

PS C:\Users\user> $OutputEncoding

Preamble          :
BodyName          : utf-8
EncodingName      : Unicode (UTF-8)
HeaderName        : utf-8
WebName           : utf-8
WindowsCodePage   : 1200
IsBrowserDisplay  : True
IsBrowserSave     : True
IsMailNewsDisplay : True
IsMailNewsSave    : True
IsSingleByte      : False
EncoderFallback   : System.Text.EncoderReplacementFallback
DecoderFallback   : System.Text.DecoderReplacementFallback
IsReadOnly        : True
CodePage          : 65001

Solution

  • On Windows, there are two pieces to the puzzle:

    • You must instruct PowerShell to use UTF-8 when communicating with external programs.

      • Use the following magic incantation (note that chcp 65001, which is what you'd do from cmd.exe, is not an option, because .NET caches the encodings stored in [Console]):

         $OutputEncoding = [Console]::InputEncoding = [Console]::OutputEncoding = New-Object System.Text.UTF8Encoding
        
      • See this answer for background information.

      • Update: In PowerShell v7.4+, if you use >, the redirection operator, to capture an external program's (such as python's) output in a file, the raw bytes are now being saved, making the above not strictly necessary (though you still need it if you want to pipe the output to PowerShell commands) - see this answer for background information.

    • You must instruct Python to use UTF-8 I/O too (assumes Python v3.7+):

      • Either: Pass -X utf8 (case matters) to the python executable:

        python -X utf8 -c 'print("\N{GREEK CAPITAL LETTER DELTA}")' > test_file_python.txt
        
      • Or: Before calling Python, run $env:PYTHONUTF8=1

      • The above enables Python UTF-8 Mode, which will become the default in Python 3.15.


    An alternative via a one-time configuration step is to switch your machine to use UTF-8 system-wide, in which case the above steps aren't necessary; however, this has far-reaching consequences and can break legacy scripts and applications - see this answer.


    Background information:

    PowerShell is partly a good Windows console citizen:

    • It uses the encoding implied by the console window's active code pages (there's one for input, and one for output), which default to the system's legacy OEM code page; specifically:

      • When decoding output from an external program, it uses the console's output code page, as reflected in .NET in [Console]::OutputEncoding, which is what external programs are at least historically expected to use when encoding their output.

      • When encoding input to provide to external programs via the pipeline (the target program's stdin stream), PowerShell makes a strange choice, however; instead of using the console's active (input) code page, it uses the encoding stored in the $OutputEncoding preference variable, which has unexpected defaults:

        • In Windows PowerShell (the legacy, Windows-only, ships-with-Windows edition whose latest and last version is v5.1), it defaults to ASCII(!)

        • In PowerShell (Core) 7+ (the modern, cross-platform, install-on-demand edition) it defaults to UTF-8(!).

          • Note: PowerShell 7+ internally uses (BOM-less) UTF-8 consistently when reading files, including source code, and writing to files, but - of necessity - decoding output from external programs must still be based on the console's (output) code page.

          • GitHub issue #7233 suggests making at least interactive PowerShell sessions also default to UTF-8 with respect to external programs, by setting the console code pages to 65001.

    Python exhibits nonstandard behavior:

    • When it finds its stdout stream redirected, it uses the system's legacy ANSI(!) code page for encoding its output by default.

    • When printing directly to the console, problems that would result from misinterpretation when the output is captured or redirected do not surface, because Python then uses the relevant Unicode WinAPI to print to the console, bypassing any encoding issues:

      • In other words: Python's output always displays correctly when output directly, but misinterpretation can occur when redirecting output to a file, passing it on through PowerShell's pipeline, or capturing it in a PowerShell variable.