I am converting Strings with weird symbols I don't want into Latin-1 (or at least, what Microsoft made of it) and back into a string. I use PowerShell, but this is only about the .NET Methods:
$bytes = [System.Text.Encoding]::GetEncoding(1252).GetBytes($String)
$String = [System.Text.Encoding]::GetEncoding(1252).GetString($bytes)
This works pretty weird, except the weird symbols don't get removed, but question marks are created, for example:
"Helloäöü?→"
becomes
"Helloäöü?????"
What I want is to only convert valid bytes, without creating question marks, so the output will be:
"Helloäöü?"
Is that possible? I searched a bit already, but couldn't find anything. ChatGPT lies to me and says there would be a "GetValidBytes" method, but there isn't...
One option is to use a regex-based -replace
operation based on named Unicode blocks:
"Helloäöü€?→" -creplace '[^\p{IsBasicLatin}\p{IsLatin-1Supplement}–—€‚‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•˜™š›œžŸ]'
Given that your input already is a .NET string (and therefore composed of UTF-16 code units), there's no strict need for conversion to and from bytes:
\p{IsBasicLatin}
and \p{IsLatin-1Supplement
matches characters that fall into the ISO-8859-1 Unicode subrange, which is mostly the same as Windows-1252, but is missing a few characters.
The explicitly enumerated characters (€...
) are those Windows-1252 characters not present in ISO-8859-1 (which therefore have different code points in Unicode than in Windows-1252, namely outside the 8-bit range).
–
and —
(en dash and em dash) are placed first, so that they aren't mistaken for describing a range of characters (the .NET regex engine apparently allows their interchangeable use with -
, the regular "dash" (ASCII-range hyphen).‚
(single low-9 quotation mark) is doubled in order to escape it, because PowerShell allows its interchangeable use with '
(single quotes) - see also: this answer summarizes all such interchangeable uses allowed in PowerShell.By replacing all non-matching (^
) characters with the (implied) empty string, all non-Windows-1252 characters are effectively removed.
A general caveat:
However, your to-and-from-bytes encoding approach can be used with a slight adaptation, which works with any target encoding (without needing to enumerate individual characters, such as above):
Using a System.Text.EncoderReplacementFallback
instance initialized with the empty string effectively removes all characters that cannot be represented in the target encoding.
$string = "Helloäöü€?→"
$encoding = [System.Text.Encoding]::GetEncoding(
1252,
# Replace non-Windows-1252 chars. with '' (empty string), i.e. *remove* them.
[System.Text.EncoderReplacementFallback]::new(''),
[System.Text.DecoderFallback]::ExceptionFallback # not relevant here
)
$string = $encoding.GetString($encoding.GetBytes($string))