Search code examples
powershellencodingcharacter-encoding

Powershell Set-Content encoding


part of the script looks like this:

$template = Get-Content "./template/temaplate.htm" -raw
$html = $template.Replace('{{imie}}', $imie).Replace('{{nazwisko}}', $nazwisko).Replace('{{stanowisko}}', $stanowisko).Replace('{{mobile}}', $mobile).Replace('{{kapital}}', $kapital).Replace('{{telefon}}', $telefon)
Set-Content -Encoding UTF8 "output/podpis.htm" -Value $html

temaplate.htm has for example word "Sąd" or "Wrocław" but after running Set-Content all polish special characters are lost "SÄ…d", "WrocĹ‚aw" i dont really understand why. the template also have set

<meta charset="UTF-8">

Solution

  • Your symptom implies:

    • Your file is UTF-8-encoded but doesn't have a BOM.

    • You're using Windows PowerShell, where Get-Content defaults to the system's active ANSI code page, and therefore misinterprets your file:[1]

      • Note that Get-Content does not try to interpret the content of the file, and therefore the presence of <meta charset="UTF-8"> inside it is irrelevant.
        All that matters is whether the file starts with a Unicode BOM (which unequivocally identifies the character encoding) or not (in which case an encoding must be assumed).

      • Using -Encoding utf8 only with Set-Content is then too late, because the misinterpretation has already happened.

    Note that you would not have this problem in PowerShell (Core) 7+, which consistently defaults to (BOM-less) UTF-8.


    Therefore, use -Encoding utf8 also in your Get-Content call:

    $template = Get-Content -Encoding UTF8 "./template/temaplate.htm" -Raw
    # ...
    Set-Content -Encoding UTF8 "output/podpis.htm" -Value $html
    

    Caveat:

    • In Windows PowerShell, Set-Content -Encoding UTF8 invariably creates a UTF-8 file with BOM. If that is undesired, use New-Item as a workaround:
    # Creates a BOM-less UTF-8 file even in Windows PowerShell.
    New-Item -Force "output/podpis.htm" -Value $html
    

    (Again, in PowerShell (Core) 7+ you wouldn't have that problem: all cmdlets there create BOM-less UTF-8 files by default; -Encoding utf8bom is needed to explicitly request a BOM.)

    See this answer for additional information.


    [1] Specifically, each byte in a multi-byte UTF-8 encoding sequence representing a single non-ASCII-range character is misinterpreted as its own character, namely a character from the ANSI character set. You can reproduce this as follows, assuming that Windows-1252 is the active ANSI code page: [Text.Encoding]::GetEncoding(1252).GetString([Text.Encoding]::UTF8.GetBytes('ą')) - this yields Ä…, i.e. two (different) characters, as in your question.