Only Convert Valid Bytes with .NET GetBytes Method without creating question marks

I am converting Strings with weird symbols I don't want into Latin-1 (or at least, what Microsoft made of it) and back into a string. I use PowerShell, but this is only about the .NET Methods:

    $bytes = [System.Text.Encoding]::GetEncoding(1252).GetBytes($String)
    $String = [System.Text.Encoding]::GetEncoding(1252).GetString($bytes)

This works pretty weird, except the weird symbols don't get removed, but question marks are created, for example:

"Helloäöü?→"

becomes

"Helloäöü?????"

What I want is to only convert valid bytes, without creating question marks, so the output will be:

"Helloäöü?"

Is that possible? I searched a bit already, but couldn't find anything. ChatGPT lies to me and says there would be a "GetValidBytes" method, but there isn't...

Solution

One option is to use a regex-based -replace operation based on named Unicode blocks:

"Helloäöü€?→" -creplace '[^\p{IsBasicLatin}\p{IsLatin-1Supplement}–—€‚‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•˜™š›œžŸ]'

Given that your input already is a .NET string (and therefore composed of UTF-16 code units), there's no strict need for conversion to and from bytes:

\p{IsBasicLatin} and \p{IsLatin-1Supplement matches characters that fall into the ISO-8859-1 Unicode subrange, which is mostly the same as Windows-1252, but is missing a few characters.
The explicitly enumerated characters (€...) are those Windows-1252 characters not present in ISO-8859-1 (which therefore have different code points in Unicode than in Windows-1252, namely outside the 8-bit range).
- – and — (en dash and em dash) are placed first, so that they aren't mistaken for describing a range of characters (the .NET regex engine apparently allows their interchangeable use with -, the regular "dash" (ASCII-range hyphen).
- ‚ (single low-9 quotation mark) is doubled in order to escape it, because PowerShell allows its interchangeable use with ' (single quotes) - see also: this answer summarizes all such interchangeable uses allowed in PowerShell.

By replacing all non-matching (^) characters with the (implied) empty string, all non-Windows-1252 characters are effectively removed.

A general caveat:

Due to the use of literal non-ASCII-range characters in the command, be sure that PowerShell interprets your script file's character encoding correctly, which notably means using UTF-8 files with BOM for the benefit of Windows PowerShell - see this answer.

However, your to-and-from-bytes encoding approach can be used with a slight adaptation, which works with any target encoding (without needing to enumerate individual characters, such as above):

Using a System.Text.EncoderReplacementFallback instance initialized with the empty string effectively removes all characters that cannot be represented in the target encoding.

$string = "Helloäöü€?→"

$encoding = [System.Text.Encoding]::GetEncoding(
  1252,
  # Replace non-Windows-1252 chars. with '' (empty string), i.e. *remove* them.
  [System.Text.EncoderReplacementFallback]::new(''),
  [System.Text.DecoderFallback]::ExceptionFallback # not relevant here
)

$string = $encoding.GetString($encoding.GetBytes($string))