python powershell character-encoding subprocess

Using Python subprocess to open Powershell causes encoding errors in stdout

I'm trying to run a Powershell script from python and print the output, but the output contains special characters "é".

process = subprocess.Popen([r'C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe',  'echo é'], stdout=subprocess.PIPE)
print(process.stdout.read().decode('cp1252'))

returns ","

process = subprocess.run(r'C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe echo é', stdout=subprocess.PIPE)
print(process.stdout.decode('cp1252'))

returns ","

print(subprocess.check_output(r'C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe echo é').decode('cp1252'))

returns ","

Is there an alternate method other than subprocess, or maybe a different encoding I should be using?

UTF-8 gives an error for é but returns an "r" for ®. UTF-16-le gives the error "UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x0a in position 2: truncated data".

Solution

The powershell.exe, the Windows PowerShell CLI,^[1] uses the active console window's code page to encode its stdout and stderr output, as reflected in the output from chcp, which by default is the legacy system locale's OEM code page, e.g. (expressed in Python terms) cp437.

By contrast, the code page you used - cp1252 - is an ANSI code page.

Note: Python uses the system's ANSI code page by default for encoding its stdout and stderr output, which, however, is nonstandard behavior: console applications are expected to use the current console's output code page, which is what powershell.exe does and which, as stated, is the system's OEM code page.

One option is to simply query the console window for its active (output) code page via the WinAPI and use the encoding returned:

import subprocess
from ctypes import windll

# Get the console's (output) code page, which the PowerShell CLI
# uses to encode its output.
cp = windll.kernel32.GetConsoleOutputCP()

process = subprocess.Popen(r'C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe echo é', stdout=subprocess.PIPE)

# Decode based on the active code page.
print(process.stdout.read().decode('cp' + str(cp)))

However, note that the OEM code page limits you to 256 characters; while é can be represented in CP437, for instance, other Unicode characters, such as €, cannot.

Therefore the robust option is to (temporarily) set the console output code page to 65001, which is UTF-8:

import subprocess
from ctypes import windll

# Save the current console output code page and switch to 65001 (UTF-8)
previousCp = windll.kernel32.GetConsoleOutputCP()
windll.kernel32.SetConsoleOutputCP(65001)

process = subprocess.Popen(r'C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe echo é€', stdout=subprocess.PIPE)

# Decode as UTF-8
print(process.stdout.read().decode('utf8'))

# Restore the previous output console code page.
windll.kernel32.SetConsoleOutputCP(previousCp)

Note:

The above only ensures that the PowerShell child process emits UTF-8 and that its output is decoded as such inside the Python process, which is unrelated to what character encoding Python itself uses for its output streams.
To put Python v3.7+ itself in Python UTF-8 Mode, which makes it decode input as UTF-8 and produce UTF-8 output, pass command-line option -X utf8 or define environment variable PYTHONUTF8 with a value of 1 before invocation.
To additionally make an interactive shell session use UTF-8 (use the 65001 code page) for the remainder of the session:
- In a cmd.exe session:
  - chcp 65001
- In a PowerShell session:
  - $OutputEncoding = [Console]::InputEncoding = [Console]::OutputEncoding = [System.Text.UTF8Encoding]::new()
- A simpler alternative via a one-time configuration step is to configure your system to use UTF-8 system-wide, in which case both the OEM and the ANSI code pages are set to 65001. However, this has far-reaching consequences - see this answer.

^{[1] The same applies to pwsh.exe, the CLI of the modern PowerShell (Core) 7+ edition.}