I'm trying to create a PowerShell hash table to convert non-ASCII (UTF8) characters to their ASCII look-a-likes.
Here are two hash table entries as examples: 'ñ'='n'
and 'Ñ'='N'
.
Editor's note: Using both theses entries in the same hash table literal (@{ 'ñ'='n'; 'Ñ'='N' }
) wouldn't work, because PowerShell uses hash tables with case-insensitive key lookups and therefore considers 'ñ'
and 'Ñ'
duplicate keys and complains. However, this is incidental to the problem at hand.
The first one works: 'ñ'
is 0xc3b1
. The second one does not work: 'Ñ'
is 0xc391
which PowerShell won't accept. (The problem seems to be that 0x91
is outside the range of an acceptable powershell char.)
A simpler example of the problem is:
$c = [convert]::toChar(0x91)
which results in $c
getting a value of 0x3f
instead of 0x91
. So what can I do to get 'Ñ'='N'
into the
hash table, or a char with a value of 0x91
? I've already spent hours reading web pages and experimenting.
Note: By default, PowerShell hashtables, due to using case-insensitive lookups, do not support keys that are mere case variations of another; therefore, ñ
and Ñ
- the former being the lowercase version of the latter - cannot both be used as keys - see bottom section.
In memory, all PowerShell strings are UTF-16 .NET strings, which are capable of representing all Unicode characters, so using character such as Ñ
as keys in hash tables is not a problem.
The problem you describe only arises when PowerShell misinterprets source code read from a file, due to assuming the wrong character encoding.
Your symptom suggests that your source code is UTF-8-encoded, but the file doesn't have a BOM, which causes Windows PowerShell (but, fortunately, no longer PowerShell [Core] v6+) to misinterpret the file as encoded based on the system's active legacy ANSI code page (e.g., Windows-1252 on US-English systems), a single-byte encoding.
Make sure that your source-code file is saved as UTF-8 with a BOM[1], and your problem will go away.
What you think are Unicode code points, 0xc3b1
and 0xc391
, are in reality the 2-byte UTF-8 encodings (0xc3 0xb1
and 0xc3 91
) of the true code points corresponding to ñ
and Ñ
: 0xf1
and 0xd1
As for:
[convert]::toChar(0x91)
seemingly not producing a [char]
instance with the given code point, 0x91
(decimal 145
):
It does, namely in memory, which you can easily verify:
[int] [convert]::toChar(0x91) # -> 145 (0x91)
You'll only get 0x3f
- which is a literal ?
character (try [char] 0x3f
) - if you mistakenly save the in-memory representation with ASCII encoding: since 0x91
is outside the ASCII sub-range of Unicode (which goes from 0x00
to 0x7f
), it cannot be represented in the output file, and the substitute character ?
is used.
Note that PowerShell's hash tables are case-insensitive, so you cannot have keys that are merely case variations of one another:
# !! FAILS
PS> @{ Ñ = 'LATIN CAPITAL LETTER N WITH TILDE'; ñ = 'LATIN SMALL LETTER N WITH TILDE' }
... Duplicate keys 'ñ' are not allowed in hash literals.
You must use the .NET [hashtable]
type (System.Collections.Hashtable
) directly to create case-sensitive hash tables:
# Create case-SENSITIVE hash table:
$ht = [hashtable]::new()
$ht['ñ'] = 'LATIN SMALL LETTER N WITH TILDE'
$ht['Ñ'] = 'LATIN CAPITAL LETTER N WITH TILDE'
$ht
now has 2 entries and $ht['ñ']
and $ht['Ñ']
retrieve the values case-sensitively.
By contrast, if you had used $ht = @{}
, i.e. initialized the hash table as a regular, case-insensitive hash table, you'd only get 1 entry with value 'LATIN CAPITAL LETTER N WITH TILDE'
, because the 2nd assignment, $ht['Ñ'] =
, simply updated the case-insensitively looked-up key created by the 1st statement.
[1] Alternatively, use a UTF-16 encoding, which invariably uses a BOM; the UTF-16LE form is (erroneously) referred to as Unicode
in PowerShell.