Here's the situation :
python '/path/to/my/script.py'
. The ouput is saved as stringWhen I run my "script.py" file in any Powershell from my computer, the output I get is "Cédric" but when I run the script through UiPath, the output I get is "CÚdric". I understand that the issue is somehow related to the encoding.
After some researchs, I found out that running this Powershell command line [System.Text.Encoding]::Default.EncodingName
, I get different results :
I found out that the HEX adress of "é" is E9 when using Windows-1252 encoding. But in CP850 encoding, E9 is "Ú". So I guess this is the encoding relation I'm looking for. THOUGH, I tried many things in UiPath (C#) and Powershell commands, but nothing did resolve my problem. (tried both changing encoding values or converting string into bytes to change encoding output)
And to anticipate some questions :
TLDR : Basically, the issue is located when UiPath interprets the Powershell console running the Python script
I've been stuck on that for 3 days now, only to get 2% more precise on the project I work (which is completely fine other than that); so it's not worth the time I spend on it, but I need to know
As for [System.Text.Encoding]::Default
: That you're seeing UTF-8 as the value in UiPath implies that it is using PowerShell (Core) 7+ (pwsh.exe
), the modern, install-on-demand, cross-platform edition built on .NET 5+, whereas Windows PowerShell (powershell.exe
), the legacy, ships-with-Windows, Windows-only edition is built on .NET Framework.
PowerShell honors the system's active legacy OEM code page by default when interpreting output from external programs (such as Python scripts),[1] e.g. 850
, as reported by chcp
, and as reflected in [Console]::OutputEncoding
from inside PowerShell.
That is, PowerShell interprets the byte stream received from external programs as text encoded according to [Console]::OutputEncoding
, and decodes it that way, resulting in a Unicode in-memory string representation, given that PowerShell is built on .NET whose strings are composed of UTF-16 Unicode code units ([char]
). If [Console]::OutputEncoding
doesn't match the actual encoding that the external program uses, misinterpreted text can be the result, as in your case.[2]
python script.py
results in Cédric
printing to the console, but python script.py | Write-Output
- due to use of a pipeline - involves interpretation by PowerShell, and the encoding mismatch would result in CÚdric
A UTF-8 opt-in is available:
Execute the following in PowerShell, before calling the Python script (see this answer for background information):
$OutputEncoding = [Console]::InputEncoding = [Console]::OutputEncoding = [System.Text.UTF8Encoding]::new()
Python, by contrast, defaults to the system's active legacy ANSI code page (e.g. Windows-1252).[3]
A UTF-8 opt-in is available, either:
By defining environment variable PYTHONUTF8
with value 1
: Before calling your Python script, execute $env:PYTHONUTF8=1
in PowerShell.
Or, in Python 3.7+, with explicit python
CLI calls, by using the -X utf8
option (case matters).
Note:
Given the above - assuming that your Python script only ever outputs characters that are part of the Windows-1252 code page - the alternative is to leave Python at its defaults and (temporarily) set the console encoding to Windows-1252 instead of UTF-8:
$OutputEncoding = [Console]::InputEncoding = [Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding(1252)
There is an option to NOT require this configuration, by configuring Windows to use UTF-8 system-wide, as described in this answer, which sets both the active OEM and the active ANSI code page to 65001
, i.e. UTF-8.
Caveat: This feature - still in beta as of Windows 11 22H2 - has far-reaching consequences:
It causes preexisting, BOM-less files encoded based on the culture-specific ANSI code page (e.g. Windows-1252) to be misinterpreted by default by Windows PowerShell, Python, and generally all non-Unicode Windows applications.
Note that .NET applications, including PowerShell (Core) 7+ (but not Windows PowerShell),[1] - have the inverse problem that they must deal with irrespective of this setting: Because they assume that a BOM-less file is UTF-8-encoded, they must specify the culture-specific legacy ANSI code page explicitly when reading such files.
[1] PowerShell-native commands and scripts, which run in-process, consistently communicate text via in-memory Unicode strings, due to using .NET strings, so no encoding problems can arise.
When it comes to reading files, Windows PowerShell defaults to the ANSI code page when reading source code and text files with Get-Content
, whereas PowerShell (Core) 7+ now - commendably - consistently defaults to UTF-8, also with respect to what encoding is used to write files - see this answer for more information.
[2] Specifically, Python outputs byte 0xE9
meaning it to be character é
, due to using Windows-1252 encoding. PowerShell, misinterprets this byte as referring to character Ú
, because it decodes the byte as CP850, as reflected in [Console]::OutputEncoding
. Compare [Text.Encoding]::GetEncoding(1252).GetString([byte[]] 0xE9)
(-> é
, whose Unicode code point is 0xE9
too, because Unicode is mostly a superset of Windows-1252) to [Text.Encoding]::GetEncoding(850).GetString([byte[]] 0xE9)
(-> Ú
, whose Unicode code point is 0xDA
)
[3] This applies when its stdout / stderr streams are connected to something other than a console, such when their output is captured by PowerShell.