Search code examples
powershellsvnutf-8

svn status powershell non-ascii characters problem


I have a weird thing going on in my powershell-script using svn commands. Following is an Powershell-Example Script:

    $svnOutput = svn status
    Write-Host "Output when saved in a variable"
    $svnOutput
    Write-Host "Direct Output"
    svn status

If I run this script (within the powershell-console), I get two different outputs, if one of the files have non-ascii-characters in the name (in my examples umlauts like üöä). This is the output

    Output when saved in a variable
    ?    Test_���.txt
    Direct Output
    ?    Test_äöü.txt

I am working on a Windows Server 2022 with a VisualSVNServer Version 5.4.1. I already tested following ideas:

    chcp 65001
    $svnOutput = svn status
    $svnOutput

    $OutputEncoding = [System.Text.Encoding]::UTF8
    [Console]::OutputEncoding = [System.Text.Encoding]::UTF8
    $svnOutput = svn status
    $svnOutput

    $svnOutput = & svn status | Out-String -Stream
    $svnOutput

    svn status > svn_status.txt
    $svnOutput = Get-Content -Path "svn_status.txt" -Encoding UTF8
    $svnOutput

    $svnOutput = & svn status | Out-String -Stream
    $svnOutput

But all of them give the same error.

PS: this also happends with other commands like

    $svnOutput = svn add . --force
    $svnOutput

which results in:

    A    Test_���.txt

Typing svn add . --force in a powershell instance, or even i a script works without any issues. Hopefully someone can help me here - thanks!


Solution

  • The SVN documentation states (emphasis added):

    The default character encoding is derived from your operating system's native locale.

    This is in the context of the --encoding parameter, which is documented as overriding the default encoding on submitting information ("your commit message"), but it seemingly (and sensibly) also applies when retrieving information.

    On Windows, the native locale is the so-called legacy system locale, aka language for non-Unicode programs, and it determines two encodings, via Windows code pages: the OEM code page (wich may be, e.g., CP437 or CP850) - typically used by console (terminal) applications - and the ANSI code page (e.g., Windows-1252 or Windows-1251) - typically used by GUI applications.

    While the SVN documentation doesn't spell out which of these two code pages the svn utility uses, per your own feedback it seems to be the system's active ANSI code page (which, as noted, is unusual, because console applications by convention use the OEM code page; python is similarly unusual).


    PowerShell consoles on Windows use the OEM code page by default, as reflected in [Console]::OutputEncoding.

    Thus, in order for PowerShell to interpret (decode) ANSI output correctly,[1] [Console]::OutputEncoding must be (temporarily) set to the system's active ANSI code page, as follows:

    & {
      # Temporarily change the expected output encoding to the ANSI code page.
      $prevEnc = [Console]::OutputEncoding
      [Console]::OutputEncoding = 
        [Text.Encoding]::GetEncoding(
          [int] (Get-ItemPropertyValue HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage ACP)
        )
    
      svn status
    
      # Restore the original encoding.
      Console]::OutputEncoding = $prevEnc
    
    }
    

    Note that (non-CJK) ANSI encodings are fixed single-byte encodings and therefore limited to 256 characters. If you need full Unicode support, use --encoding utf8 in your svn call and set [Console]::OutputEncoding = [Text.UTF8Encoding]::new()[2]

    See also:

    • This answer provides background information on how character encoding comes into play when PowerShell talks to external programs.

    [1] Note that decoding, i.e. converting an external program's raw byte output into .NET strings (as used by PowerShell) based on a character encoding, only comes into play when an external program's output is either captured (in a variable), relayed, or redirected. When printing directly to the display, encoding problems usually do not surface, because many CLIs use the Unicode-capable WriteConsoleW WinAPI function for that.

    [2] Assuming you have administrative privileges, another option is use UTF-8 as part of your system locale, which sets both the OEM and the ANSI code page to 65001, i.e. UTF-8. However, doing so has far-reaching consequences: see this answer.