Search code examples
powershellmemorysubstring

How can I substring a specific number of bytes from a string in PowerShell?


I have a scenario where I need to obtain an installer embedded within a JSON REST response that is base64-encoded. Since the size of the JSON string is rather large (180 MB), it causes problems when decoding the REST response using standard PowerShell tooling as it causes OutOfMemoryException to be thrown quite often in limited memory scenarios (such as hitting WinRM memory quotas).

It's not desirable to raise the memory quota in our environment over a single installation, and we don't have standard tooling to prepare a package whose payload does not exist at a simple HTTP endpoint (I don't have direct permissions to publish packages not performed through our build system). My solution in this case is to decode the base64 string in chunks. However, while I have this working, I am stuck on one last bit of optimization for this process.


Currently I am using a MemoryStream to read from the string, but I need to provide a byte[]:

# $Base64String is a [ref] type
$memStream = [IO.MemoryStream]::new([Text.Encoding]::UTF8.GetBytes($Base64String.Value))

This unsurprisingly results in copying the byte[] representation of the entire base64-encoded string, and is even less memory-efficient than built-in tooling in its current form. The code you don't see here reads from $memStream in chunks of 1024 bytes at a time, decoding the base64 string and writing the bytes to disk using BinaryWriter. This all works well, if slow since I'm forcing garbage collection fairly often. However, I want to extend this byte-counting to the initial MemoryStream and only read n bytes from the string at a time. My understanding is that base64 strings must be decoded in chunks of bytes divisible by 4.

The problem is that [string].Substring([int], [int]) works based on string length, not number of bytes per character. The JSON response can be assumed to be UTF-8 encoded, but even with this assumption UTF-8 characters vary between 1-4 bytes in length. How can I (directly or indirectly) substring a specific number of bytes in PowerShell so I can create the MemoryStream from this substring instead of the full $Base64String?

I will note that I have explored the use of the [Text.Encoding].GetBytes([string], [int], [int]) overload, however, I face the same issue in that the method expects a character count, not byte count, for the length of the string to get the byte[] for from the starting index.


Solution

  • To answer the base question "How can I substring a specific number of bytes from a string in PowerShell", I was able to write the following function:

    function Get-SubstringByByteCount {
      [CmdletBinding()]
      Param(
        [Parameter(Mandatory)]
        [ValidateScript({ $null -ne $_ -and $_.Value -is [string] })]
        [ref]$InputString,
        [int]$FromIndex = 0,
        [Parameter(Mandatory)]
        [int]$ByteCount,
        [ValidateScript({ [Text.Encoding]::$_ })]
        [string]$Encoding = 'UTF8'
      )
      
      [long]$byteCounter = 0
      [System.Text.StringBuilder]$sb = New-Object System.Text.StringBuilder $ByteCount
    
      try {
        while ( $byteCounter -lt $ByteCount -and $i -lt $InputString.Value.Length ) {
          [char]$char = $InputString.Value[$i++]
          [void]$sb.Append($char)
          $byteCounter += [Text.Encoding]::$Encoding.GetByteCount($char)
        }
    
        $sb.ToString()
      } finally {
        if( $sb ) {
          $sb = $null
          [System.GC]::Collect()
        }
      }
    }
    

    Invocation works like so:

    Get-SubstringByByteCount -InputString ( [ref]$someString ) -ByteCount 8
    

    Some notes on this implementation:

    • Takes the string as a [ref] type since the original goal was to avoid copying the full string in a limited-memory scenario. This function could be re-implemented using the [string] type instead.
    • This function essentially adds each character to a StringBuilder until the specified number of bytes has been written.
    • The number of bytes of each character is determined by using one of the [Text.Encoding]::GetByteCount overloads. Encoding can be specified via a parameter, but the encoding value should match one of the static encoding properties available from [Text.Encoding]. Defaults to UTF8 as written.
    • $sb = $null and [System.GC]::Collect() are intended to forcibly clean up the StringBuilder in a memory-constrained environment, but could be omitted if this is not a concern.
    • -FromIndex takes the start position within -InputString to begin the substring operation from. Defaults to 0 to evaluate from the start of the -InputString.