Search code examples
powershell

How to convert String System Object into a Byte System Array in powershell?


I would like to create a binary blob from a binary character string, in the same way as when reading in a binary blob from a file, into a buffer, using .NET file stream. Then I would like to read 2 bytes from a particular offset in blob.

I create a file like this:

echo "AAAABBBB" > .\zzblob.txt 
$bytes = "AAAABBBB`r`n"
$aa = [system.bitconverter]::touint16($bytes, 0)

# FAIL!

# Checking the type:
$bytes.GetType() | select Name, BaseType | ft -HideTableHeaders

# String System.Object

Now, doing the same using a stream buffer, we get something else.

$fp = ".\zzblob.txt"
$bf = (new-object byte[](256))
$sp = New-Object System.IO.FileStream($fp, [System.IO.FileMode]::Open, [System.IO.FileAccess]::Read)
$sp.Length
$sp.Read($bf, 0, 256)
$sp.close()

$aa = [system.bitconverter]::touint16($bf, 2)   # ..AA
d2h $aa
# 0x4141  ## OK!

# Checking type:
$bf.GetType() | select Name, BaseType | ft -HideTableHeaders

# Byte[] System.Array

How can I convert a string from String System.Object to Byte[] System.Array?


Solution

    • You can not convert an arbitrary .NET string to bytes without choosing a specific character encoding that should be applied to it.

      • The reasons is that different character encodings use different byte representations of characters, notably with respect to the number of bytes required to encode a single characters, which can even vary from character to character, as is the case with
        UTF-8.

      • Whoever must interpret the resulting byte array as a string again must then use the same encoding for de-coding.

    • If all the characters in a given string happen to fall into the 8-bit subrange of Unicode code points, i.e. the 256 characters occupying the Unicode code points from 0x0 to 0xFF (255) (in Unicode terms: U+0000 to U+00FF), you can use a shortcut, assuming that you want to use the Unicode code points as byte values:

      • Use [byte[]] [char[]] $string (or [byte[]] $string.ToCharArray()), as also shown in js2010's answer:

         $string = 'AAAABBBB'
        
         # Convert TO a byte array.
         $byteArray = [byte[]] [char[]] $string
         # OR:
         #    $byteArray = [byte[]] $string.ToCharArray()
        
         # Convert back FROM a byte array.
         [string]::new($byteArray) # [char[]] cast optional
         # OR, more PowerShell-idiomatically, but less efficiently:
         #    -join [char[]] $byteArray
        
        • Caveat: Any character outside that range, i.e. one with a code point of U+0100 (256) or above, e.g. (EURO SIGN, U+20AC), breaks this approach, because its code point is by definition too large to fit into a [byte] instance:

           # -> ERROR: 
           #   Cannot convert value "€" to type "System.Byte". 
           #   Error: "Value was either too large or too small for an unsigned byte."
           [byte[]] [char[]] '€'
          
      • This approach is tantamount to choosing the fixed-width, single-byte ISO-88591 character encoding for the byte representation, because the 8-bit subrange of Unicode coincides with this encoding.

        • That is, the equivalents of the above operations are (note that in PowerShell (Core) 7 you can more simpy use [Text.Encoding]::Latin1 in lieu of [Text.Encoding]::GetEncoding(28591)):

           $string = 'AAAABBBB'
          
           # Convert TO a byte array.
           $byteArray = [Text.Encoding]::GetEncoding(28591).GetBytes($string)    
           # Equivalent of:
           #    $byteArray = [byte[]] [char[]] $string
          
           # Convert back FROM a byte array.
           [Text.Encoding]::GetEncoding(28591).GetString($byteArray)
           # Equivalent of:
           #    [string]::new($byteArray)
          
      • As for writing the byte representations to a file:

        • If you have an in-memory byte representation, it is safest to write to and read from files as bytes rather than via a character encoding:

            $string = 'AAAABBBB'
            $byteArray = [byte[]] [char[]] $string
          
            # NOTE: Sadly, the syntax for requesting byte processing differs
            #       between Windows PowerShell and PowerShell 7
            #       (-Encoding Byte vs. -AsByteStream), so we construct an
            #       an edition-specific hashtable to be used for splatting below.
            $encodingArg = if ($IsCoreClr) { @{ AsByteStream = $true } } 
                           else            { @{ Encoding = 'Byte' } }
          
            # WRITE the byte array to a file.
            Set-Content blob.txt @encodingArg -Value $byteArray 
          
            # READ the byte array from a file, as such.
            # Note: -Raw -ReadCount 0 reads the entire file *at once* into
            #       a [byte[]] array.
            $byteArrayFromFile = 
              Get-Content blob.txt @encodingArg -Raw -ReadCount 0
          
        • Alternatively, in PowerShell (Core) 7, you can use -Encoding Latin1[1] with Set-Content and Get-Content to directly write and read 8-bit-Unicode range strings, but that doesn't work in Windows PowerShell, where you'd have to use .NET APIs directly.


    [1] The ISO-88591 encoding that -Encoding Latin1 refers to is closely related to, but not identical to Windows-1252, so using the latter - which -Encoding Default may refer to in Windows PowerShell, depending on the system locale (e.g. on US-English and Western European machines) - is not an option.