Search code examples
powershellencodingascii

Casting int to chars in Powershell has unexpected results


I am trying to generate strings with 1 of every ASCII character. I started with

32..255| %{[char]$_ | Out-File -filepath .\outfile.txt -Encoding ASCII -Append}

I expected the list of printable characters, but I got different characters.

Can anyone point me to either a better way to get my expected result or an explanation as to why I'm getting these results?


Solution

  • [char[]] (32..255) | Set-Content outfile.txt
    

    In Windows PowerShell this will create an "ANSI"-encoded file. The term "ANSI" encoding is an umbrella term for the set of fixed-width, single-byte, 8-bit encodings on Windows that are a superset of ASCII encoding. The specific "ANSI" encoding that is used is implied by the code page associated with the legacy system locale in effect on your system[1]; e.g., Windows-1252 on US-English systems.

    See the bottom section for why "ANSI" encoding should be avoided.

    If you were to do the same thing in PowerShell (Core) 7+, you'd get a UTF-8-encoded file without a BOM, which is the best encoding to use for cross-platform and cross-locale compatibility.

    In Windows PowerShell, adding -Encoding utf8 would give you an UTF-8 file too, but with a BOM.[2]

    If you used -Encoding Unicode or simply used redirection operator > or Out-File, you'd get a UTF-16LE-encoded file.
    (In PowerShell (Core), by contrast, > produces BOM-less UTF-8 by default, because the latter is the consistently applied default encoding).

    Note: With strings and numbers, Set-Content and > / Out-File can be used interchangeably (encoding differences in Windows PowerShell aside); for other types, only > / Out-File produces meaningful representations, albeit suitable only for human eyeballs, not programmatic processing - see this answer for more.

    ASCII code points are limited to 7-bit values, i.e., the range 0x0 - 0x7f (127).

    Therefore, your input values 128 - 255 cannot be represented as ASCII characters, and using -Encoding ASCII results in invalid input characters getting replaced with literal ? characters (code point 0x3f / 63), resulting in loss of information.


    Important:

    In memory, casting numbers such as 32 (0x20) or 255 (0xFF) to [char] (System.Char) instances causes the numbers to be interpreted as UTF-16 code units, representing Unicode characters[3] such as U+0020 and U+00FF as 2-byte sequences using the native byte order, because that's what characters are in .NET.
    Similarly, instances of the .NET [string] type System.String are sequences of one or more [char] instances.

    On output to a file or during serialization, re-encoding of these UTF-16 strings may occur, depending on the implied or specified output encoding.

    • If the output encoding is a fixed single-byte encoding, such as ASCII, Default ("ANSI"), or OEM, loss of information may occur, namely if the string to output contains characters that cannot be represented in the target encoding.

    • Choose one of the Unicode-based encoding formats to guarantee that:

      • no information is lost,
      • the resulting file is interpreted the same on all systems, irrespective of their system locale.
      • UTF-8 is the most widely recognized encoding, but note that Windows PowerShell (unlike PowerShell Core) invariably prepends a BOM to such files, which can cause problems on Unix-like platforms and with utilities of Unix heritage; it is a format focused on and optimized for backward compatibility with ASCII encoding that uses between 1 - 4 bytes to encode a single character.
      • UTF-16LE (which PowerShell calls Unicode) is a direct representation of the in-memory code units, but note that each characters is encoded with (at least) 2 bytes, which results in up to twice the size of UTF-8 files for strings that primarily contain characters in the ASCII range.
      • UTF-16BE (which PowerShell calls bigendianunicode) reverses the byte order in each code unit.
      • UTF-32LE (which PowerShell calls UTF32), represents each Unicode character as a fixed 4-byte sequence; even more so than with UTF-16, this typically results in unnecessarily large files.
      • UTF-7 should be avoided altogether, as it is not part of the Unicode standard.

    [1] Among the legacy code pages supported on Windows, there are also fixed double-byte as well as variable-width encodings, but only for East Asian locales; sometimes they're (incorrectly) collectively referred to as DBCS (Double-Byte Character Set), as opposed to SBCS (Single-Byte Character Set); see the list of all Windows code pages.

    [2] See this answer for how to create BOM-less UTF-8 files in Windows PowerShell.

    [3] Strictly speaking, a UTF-16 code unit identifies a Unicode code point, but not every code point by itself is a complete Unicode character, because some (rare) Unicode characters have a code point value that falls outside the range that can be represented with a 16-bit integer, and these code points can alternatively represented by a sequence of 2 other code points, known as surrogate pairs.