Search code examples
.netpowershellencoding

Only Convert Valid Bytes with .NET GetBytes Method without creating question marks


I am converting Strings with weird symbols I don't want into Latin-1 (or at least, what Microsoft made of it) and back into a string. I use PowerShell, but this is only about the .NET Methods:

    $bytes = [System.Text.Encoding]::GetEncoding(1252).GetBytes($String)
    $String = [System.Text.Encoding]::GetEncoding(1252).GetString($bytes)

This works pretty weird, except the weird symbols don't get removed, but question marks are created, for example:

"Helloäöü?→"

becomes

"Helloäöü?????"

What I want is to only convert valid bytes, without creating question marks, so the output will be:

"Helloäöü?"

Is that possible? I searched a bit already, but couldn't find anything. ChatGPT lies to me and says there would be a "GetValidBytes" method, but there isn't...


Solution

  • One option is to use a regex-based -replace operation based on named Unicode blocks:

    "Helloäöü€?→" -creplace '[^\p{IsBasicLatin}\p{IsLatin-1Supplement}–—€‚‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•˜™š›œžŸ]'
    

    Given that your input already is a .NET string (and therefore composed of UTF-16 code units), there's no strict need for conversion to and from bytes:

    • \p{IsBasicLatin} and \p{IsLatin-1Supplement matches characters that fall into the ISO-8859-1 Unicode subrange, which is mostly the same as Windows-1252, but is missing a few characters.

    • The explicitly enumerated characters (€...) are those Windows-1252 characters not present in ISO-8859-1 (which therefore have different code points in Unicode than in Windows-1252, namely outside the 8-bit range).

      • and (en dash and em dash) are placed first, so that they aren't mistaken for describing a range of characters (the .NET regex engine apparently allows their interchangeable use with -, the regular "dash" (ASCII-range hyphen).
      • (single low-9 quotation mark) is doubled in order to escape it, because PowerShell allows its interchangeable use with ' (single quotes) - see also: this answer summarizes all such interchangeable uses allowed in PowerShell.

    By replacing all non-matching (^) characters with the (implied) empty string, all non-Windows-1252 characters are effectively removed.

    A general caveat:

    • Due to the use of literal non-ASCII-range characters in the command, be sure that PowerShell interprets your script file's character encoding correctly, which notably means using UTF-8 files with BOM for the benefit of Windows PowerShell - see this answer.

    However, your to-and-from-bytes encoding approach can be used with a slight adaptation, which works with any target encoding (without needing to enumerate individual characters, such as above):

    Using a System.Text.EncoderReplacementFallback instance initialized with the empty string effectively removes all characters that cannot be represented in the target encoding.

    $string = "Helloäöü€?→"
    
    $encoding = [System.Text.Encoding]::GetEncoding(
      1252,
      # Replace non-Windows-1252 chars. with '' (empty string), i.e. *remove* them.
      [System.Text.EncoderReplacementFallback]::new(''),
      [System.Text.DecoderFallback]::ExceptionFallback # not relevant here
    )
    
    $string = $encoding.GetString($encoding.GetBytes($string))