Search code examples
powershellcommand-line-interface

non-ASCII characters in CLI output in powershell script


How can I retrieve users with non-latin names from this output?

.\Pacli.exe USERSLIST INCLUDESUBLOCATIONS=YES output`(name`, enclose`, type`) > users.txt

This saves and recalls the non-latin characters as ? or even with Get-Content -Encoding UTF8.

I tried to set

$OutputEncoding = [Console]::OutputEncoding = New-Object System.Text.UTF8Encoding

before this command but got the same result.


Solution

  • tl;dr

    Use the code at the bottom to temporarily change [Console]::OutputEncoding to match PACLI.exe's nonstandard output encoding, which appears to be ANSI, to ensure that its output is decoded correctly.


    Per your own feedback, it turns out that PACLI.exe exhibits nonstandard behavior and outputs Windows-1252-encoded text.

    • Note that the specific code page used on a given system may be driven more abstractly by the legacy ANSI code page associated with that system's legacy system locale (aka language for non-Unicode programs). This is the - also nonstandard - behavior that Python exhibits, for instance.

      • E.g., on a US English machine the ANSI code page would be 1252 (Windows-1252), but on a Russian machine it would be 1251 (Windows-1251).
    • The solution below assumes that PACLI.exe too exhibits this ANSI-code-page-dependent behavior, so it uses the following to retrieve the current machine's ANSI code page, whatever it may be; if you know that PACLI.exe hard-codes use of 1252, specifically, replace the expression with verbatim 1252:

      [int] (Get-ItemPropertyValue HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage ACP)
      

    Capturing output from external programs that use nonstandard character encodings:

    • It is a general requirement that the character encoding stored in [Console]::OutputEncoding must match the actual encoding used in the output from an external program, because PowerShell uses the former to decode the latter.

      • Note:

        • This applies whenever PowerShell captures the output, such as in a variable or as part of an expression, or by relaying it to another command via the pipeline.

        • There is a notable exception with respect to >, the redirection operator in PowerShell (Core) 7.4+: when applied to an external program, the raw output bytes are passed through to the target file. To instead capture the output and save it with a different encoding, use the pipeline and Set-Content.

      • The encoding used for sending data to an external program via the pipeline is stored in the $OutputEncoding preference variable, which - unfortunately - defaults to ASCII(!) in _Windows PowerShell, and to UTF-8 in PowerShell (Core) 7 (which is preferable to ASCII, but inconsistent with the [Console]::OutputEncoding value - see GitHub issue #7233).

        • Again, there is an exception in PowerShell 7.4+: if the pipeline input is also provided by an external program, the raw bytes are passed through.
    • Standard behavior of console applications would be to respect the current console's output code page, as reflected in the encoding stored in [Console]::OutputEncoding, in which case no extra effort is needed to properly decode and capture output.

      • Console windows / Windows Terminal tabs default to the legacy OEM code page associated with a given machine's system locale, such as CP437 on US-English systems.
    • It is the limitations of the single-byte[1] OEM code pages - which limits what you can output to 256 characters that increasingly cause modern CLIs to output UTF-8, as it is capable of encoding all Unicode characters. node.exe, the Node.Js CLI, is one example. Others allow UTF-8 opt-in, via command-line options or environment variables.

    • The PACLI.exe and Python behavior of choosing the ANSI code page for their nonstandard encoding is unfortunate, because ANSI code pages are single-byte[1] too, and therefore don't solve the problem of limited character repertoire.

    • There is a system-wide solution that makes most programs behave properly without extra effort; however - it has far-reaching consequences and can change the behavior of existing scripts in undesired ways.

      • Assuming you have administrative privileges, you can set the legacy system locale to UTF-8, which sets both the OEM and the ANSI code page to 65001, the UTF-8 code page. For details and a discussion of the far-reaching consequences, see this answer.

      • Note that this solution won't help with Windows CLIs such as sfc.exe and wsl.exe, which (situationally) output UTF-16LE; unless such CLIs offer UTF-8 opt-in (e.g. WSL's $env:WSL_UTF8=1), you still need to temporarily modify [Console]::OutputEncoding, as shown below.

    • Otherwise, you'll need to temporarily change [Console]::OutputEncoding to match a nonstandard CLI's output encoding, as shown below. That is, save the current value of [Console]::OutputEncoding before changing it, and restore it afterwards, to avoid affecting subsequent calls to (standard) external (console) applications (by default, because changing [Console]::OutputEncoding affects the console window, it stays in effect for the remainder of the session).

    Capturing output from a CLI that outputs ANSI, like (presumably) PACLI.exe and Python, using your specific PACLI.exe call:

    & {
      # Temporarily change the expected output encoding to ANSI.
      $prevEnc = [Console]::OutputEncoding
      [Console]::OutputEncoding = 
        [Text.Encoding]::GetEncoding(
          [int] (Get-ItemPropertyValue HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage ACP)
        )
    
      try {
    
       .\Pacli.exe USERSLIST INCLUDESUBLOCATIONS=YES output`(name`, enclose`, type`) |
          Set-Content -Encoding utf8 users.txt
    
      } finally {
    
        # Restore the original encoding.
        [Console]::OutputEncoding = $prevEnc
    
      }
    }
    
    • For UTF-8, use [Console]::OutputEncoding = [Text.UTF8Encoding]::new()

    • Note that > users.txt was deliberately replaced with | Set-Content -Encoding utf8 users.txt, to predictably generate a UTF-8 output file, in both PowerShell editions - although in Windows PowerShell the file will have a BOM.[2]

      • That is, the use of the pipeline with a file-saving command ensures that decoding into .NET strings of the external-program output takes place first, with the file-saving command then using its default encoding or the encoding specified via -Encoding. This in effect allows you to transcode the output; in the case at hand, ANSI output turns into UTF-8 output.

      • In Windows PowerShell and PowerShell 7 up to 7.3.x, the > operator too exhibits this behavior, where it is in effect an alias of piping to Out-File using the latter's default encoding, which is UTF-16LE ("Unicode") in Windows PowerShell, and (BOM-less) UTF-8 in PowerShell 7.3-

      • As noted above, with respect to external programs > in PowerShell 7.4+ now behaves differently, and captures the raw byte output in the target file; that is, with > users.txt the above would create an ANSI file.


    [1] Except in CJK system locales.

    [2] Unfortunately, workarounds are required to create BOM-less UTF-8 files in Windows PowerShell (which PowerShell 7 creates by default) - see this answer.