I am trying to generate strings with 1 of every ASCII character. I started with
32..255| %{[char]$_ | Out-File -filepath .\outfile.txt -Encoding ASCII -Append}
I expected the list of printable characters, but I got different characters.
Can anyone point me to either a better way to get my expected result or an explanation as to why I'm getting these results?
[char[]] (32..255) | Set-Content outfile.txt
In Windows PowerShell this will create an "ANSI"-encoded file. The term "ANSI" encoding is an umbrella term for the set of fixed-width, single-byte, 8-bit encodings on Windows that are a superset of ASCII encoding. The specific "ANSI" encoding that is used is implied by the code page associated with the legacy system locale in effect on your system[1]; e.g., Windows-1252 on US-English systems.
See the bottom section for why "ANSI" encoding should be avoided.
If you were to do the same thing in PowerShell (Core) 7+, you'd get a UTF-8-encoded file without a BOM, which is the best encoding to use for cross-platform and cross-locale compatibility.
In Windows PowerShell, adding -Encoding utf8
would give you an UTF-8 file too, but with a BOM.[2]
If you used -Encoding Unicode
or simply used redirection operator >
or Out-File
, you'd get a UTF-16LE-encoded file.
(In PowerShell (Core), by contrast, >
produces BOM-less UTF-8 by default, because the latter is the consistently applied default encoding).
Note: With strings and numbers, Set-Content
and >
/ Out-File
can be used interchangeably (encoding differences in Windows PowerShell aside); for other types, only >
/ Out-File
produces meaningful representations, albeit suitable only for human eyeballs, not programmatic processing - see this answer for more.
ASCII code points are limited to 7-bit values, i.e., the range 0x0
- 0x7f
(127
).
Therefore, your input values 128
- 255
cannot be represented as ASCII characters, and using -Encoding ASCII
results in invalid input characters getting replaced with literal ?
characters (code point 0x3f
/ 63
), resulting in loss of information.
Important:
In memory, casting numbers such as 32
(0x20
) or 255
(0xFF
) to [char]
(System.Char
) instances causes the numbers to be interpreted as UTF-16 code units, representing Unicode characters[3] such as U+0020
and U+00FF
as 2-byte sequences using the native byte order, because that's what characters are in .NET.
Similarly, instances of the .NET [string]
type System.String
are sequences of one or more [char]
instances.
On output to a file or during serialization, re-encoding of these UTF-16 strings may occur, depending on the implied or specified output encoding.
If the output encoding is a fixed single-byte encoding, such as ASCII
, Default
("ANSI"), or OEM
, loss of information may occur, namely if the string to output contains characters that cannot be represented in the target encoding.
Choose one of the Unicode-based encoding formats to guarantee that:
Unicode
) is a direct representation of the in-memory code units, but note that each characters is encoded with (at least) 2 bytes, which results in up to twice the size of UTF-8 files for strings that primarily contain characters in the ASCII range.bigendianunicode
) reverses the byte order in each code unit.UTF32
), represents each Unicode character as a fixed 4-byte sequence; even more so than with UTF-16, this typically results in unnecessarily large files.[1] Among the legacy code pages supported on Windows, there are also fixed double-byte as well as variable-width encodings, but only for East Asian locales; sometimes they're (incorrectly) collectively referred to as DBCS (Double-Byte Character Set), as opposed to SBCS (Single-Byte Character Set); see the list of all Windows code pages.
[2] See this answer for how to create BOM-less UTF-8 files in Windows PowerShell.
[3] Strictly speaking, a UTF-16 code unit identifies a Unicode code point, but not every code point by itself is a complete Unicode character, because some (rare) Unicode characters have a code point value that falls outside the range that can be represented with a 16-bit integer, and these code points can alternatively represented by a sequence of 2 other code points, known as surrogate pairs.