Search code examples
powershellcmdautomation

Special characters become question marks after Command line find and replace


I have a text file input.xlf

  <trans-unit id="loco:5e7257a0c38e0f5b456bae94">
    &lt;source&gt;Login</source>
    <target>登入</target>
    <note>Login Header</note>
  </trans-unit>

Basically I need to replace &lt; with < and &gt; with '>', so I run below script

runner.bat

powershell -Command "(gc input.xlf) -replace '&lt;', '<' | Out-File -encoding ASCII output.xlf";
powershell -Command "(gc output.xlf) -replace '&gt;', '>' | Out-File -encoding ASCII  output.xlf";

The above was working until I noticed below as the output

  <trans-unit id="loco:5e7257a0c38e0f5b456bae94">
    <source>Login</source>
    <target>??????</target>
    <note>Login Header</note>
  </trans-unit>

I tried removing the encoding but now I get

 <trans-unit id="loco:5e7257a0c38e0f5b456bae94">
   <source>Login</source>
   <target>登入</target>
   <note>Login Header</note>  
 </trans-unit>

Below is my desired output

  <trans-unit id="loco:5e7257a0c38e0f5b456bae94">
    <source>Login</source>
    <target>登入</target>
    <note>Login Header</note>
  </trans-unit>

Solution

  • There are (potentially) two character-encoding problems:

    • On output, using -Encoding Ascii is guaranteed to "lossily" transliterate any non-ASCII-range characters to literal ? characters.

      • To preserve all characters, you must choose a Unicode encoding, such as -Encoding Utf8
    • On input, you must ensure that the input file is correctly read by PowerShell.

      • Specifically, Windows PowerShell misinterprets BOM-less UTF-8 files as ANSI-encoded, so you need to use -Encoding Utf8 with Get-Content too.

    Additionally, you can get away with a single powershell.exe call, and you can additionally optimize this call:

    powershell -Command "(gc -Raw -Encoding utf8 input.xlf) -replace '&lt;', '<' -replace '&gt;', '>' | Set-Content -NoNewLine -Encoding Utf8 output.xlf"
    
    • Using -Raw with gc (Get-Content) reads the file as a whole instead of into an array of lines, which speeds up the -replace operations.

    • You can chain -replace operations

    • With input that is already text (strings), Set-Content is generally the faster choice.[1]
      -NoNewLine prevents an extra trailing newline from getting appended.


    [1] It will make virtually no difference here, given that only a single string is written, but with many input strings (line-by-line output) it may - see this answer.