Search code examples
rubypowershellencodingstreamfile-type

What 's the reason of Ruby IO stream in Windows powershell file type encoding


I have a strangeness problem. In Windows7 operation system,I try to run the command in powershell.

ruby -E UTF-8 -e "puts 'どうぞよろしくお願いします,Mr Jason'" > test.txt

When I read test.txt file:

ruby -E UTF-8 -e "puts gets" < test.txt

the result is:

�i0F0^0�0�0W0O0J0X�D0W0~0Y0,Mr Jason

I check test.txt file,find the file type encoding is Unicode,not UTF-8.

What should I do ?

How should I ensure the encoding of the output file type after redirection? Please help me.


Solution

  • tl;dr

    Unfortunately, the solution (on Windows) is much more complicated than one would hope:

    # Make PowerShell both send and receive data as UTF-8 when talking to
    # external (native) programs.
    # Note: 
    #  * In *PowerShell (Core) 7+*, $OutputEncoding *defaults* to UTF-8.
    #  * You may want to save and restore the original settings.
    $OutputEncoding = [Console]::OutputEncoding = [Text.UTF8Encoding]::new()
     
    # Create a BOM-less UTF-8 file.
    # Note: In *PowerShell (Core) 7+*, you can less obscurely use:
    #   ruby -E UTF-8 -e "puts 'どうぞよろしくお願いします,Mr Jason'" | Set-Content test.txt
    $null = New-Item -Force test.txt -Value (
      ruby -E UTF-8 -e "puts 'どうぞよろしくお願いします,Mr Jason'"
    )
    
    # Pipe the resulting file back to Ruby as UTF-8, thanks to $OutputEncoding
    # Note that PowerShell has NO "<" operator - stdin input must be provided
    # via the pipeline.
    Get-Content -Raw test.txt | ruby -E UTF-8 -e "puts gets"
    

    • In terms of character encoding, PowerShell communicates with external (native) programs via two settings that contain .NET System.Text.Encoding instances:

      • $OutputEncoding specifies the encoding to use to send data TO an external program via the pipeline.

      • [Console]::OutputEncoding specifies the encoding to interpret (decoded) data FROM an external program('s stdout stream); for decoding to work as intended, this setting must match the external program's actual output encoding.

    • As of PowerShell 7.3.1, PowerShell only "speaks text" when communicating with external programs, and an intermediate decoding and re-encoding step is invariably involved - even when you're just using > (effectively an alias of the Out-File cmdlets) to send output to a file.

      • That is, PowerShell's pipelines are NOT raw byte conduits the way the are in other shells.

        • See this answer for workarounds and potential future raw-byte support.
      • Whatever output operator (>) or cmdlet (Out-File, Set-Content) you use will use its default character encoding, which is unrelated to the encoding of the original input, which has already been decoded into .NET strings when the operator / cmdlet operates on it.

        • > / Out-File in Windows PowerShell defaults to "Unicode" (UTF-16LE) encoding, which is what you saw.

        • While Out-File and Set-Content have an -Encoding parameter that allows you to control the output encoding, in Windows PowerShell they don't allow you to create BOM-less UTF-8 files; curiously, New-Item does create such files, which is why it is used above; if a UTF-8 BOM is acceptable, ... | Set-Content -Encoding utf8 will do in Windows PowerShell.

        • Note that, by contrast, PowerShell (Core) 7+, the modern, cross-platform edition now thankfully consistently defaults to BOM-less UTF-8.

          • That said, with respect to [Console]::OutputEncoding on Windows, it still uses the legacy OEM code page by default as of v7.3.1, which means that UTF-8 output from external programs is by default misinterpreted - see GitHub issue #7233 for a discussion.