Search code examples
powershellfilemultipart

How to pick out the binary part from a multipart file with powershell?


I have a multipart file received from a server and I need to pick out the pdf part from it. I tried with removing the first x lines and the last 2 with

$content=Get-Content $originalfile
$content[0..($content.length-3)] |$outfile

but it corrupts the binary data, so what is the way to get the binary part from the file?

MIME-Version: 1.0
Content-Type: multipart/related; boundary=MIME_Boundary; 
    start="<6624867311297537120--4d6a31bb.16a77205e4d.3282>"; 
    type="text/xml"

--MIME_Boundary
Content-ID: <6624867311297537120--4d6a31bb.16a77205e4d.3282>
Content-Type: text/xml; charset=utf-8
Content-Transfer-Encoding: 8bit

<?xml version="1.0" encoding="UTF-8"?>
<soapenv:Body xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"/>
--MIME_Boundary
Content-ID: 
Content-Type: application/xml
Content-Disposition: form-data; name="metadata"

<?xml version="1.0" encoding="ISO-8859-1"?>
<metadata><contentLength>64288</contentLength><etag>7e3da21f7ed1b434def94f4b</etag><contentType>application/octet-stream</contentType><properties><property><key>Account</key><value>finance</value></property><property><key>Business Unit</key><value>EU DEBMfg</value></property><property><key>Document Type</key><value>PAYABLES</value></property><property><key>Filename</key><value>test-pdf.pdf</value></property></properties></metadata>
--MIME_Boundary
Content-ID: 
Content-Type: application/octet-stream
Content-Disposition: form-data; name="content"

%PDF-1.6
%âãÏÓ
37 0 obj <</Linearized 1/L 20597/O 40/E 14115/N 1/T 19795/H [ 1005 215]>>
endobj

xref
37 34
0000000016 00000 n
0000001386 00000 n
0000001522 00000 n
0000001787 00000 n
0000002250 00000 n
.
.
.
0000062787 00000 n
0000063242 00000 n
trailer
<<
    /Size 76
    /Prev 116
    /Root 74 0 R
    /Encrypt 38 0 R
    /Info 75 0 R
    /ID [ <C21F21EA44C1E2ED2581435FA5A2DCCE> <3B7296EB948466CB53FB76CC134E3E76> ]
>>
startxref
63926
%%EOF

--MIME_Boundary-

Solution

  • You need to read the file as a series of bytes and treat it as a binary file. Next, to parse out the PDF part of the file, you need to read it again as String, so you can perform Regular Expression on it.

    The String should be in an encoding that does not alter the bytes in any way, and for that, there is the special encoding Codepage 28591 (ISO 8859-1) with which the bytes in the original file are used as-is.

    To do this, I've written the following helper function:

    function ConvertTo-BinaryString {
        # converts the bytes of a file to a string that has a
        # 1-to-1 mapping back to the file's original bytes. 
        # Useful for performing binary regular expressions.
        Param (
            [Parameter(Mandatory = $True, ValueFromPipeline = $True, Position = 0)]
            [ValidateScript( { Test-Path $_ -PathType Leaf } )]
            [String]$Path
        )
    
        $Stream = New-Object System.IO.FileStream -ArgumentList $Path, 'Open', 'Read'
    
        # Note: Codepage 28591 (ISO 8859-1) returns a 1-to-1 char to byte mapping
        $Encoding     = [Text.Encoding]::GetEncoding(28591)
        $StreamReader = New-Object System.IO.StreamReader -ArgumentList $Stream, $Encoding
        $BinaryText   = $StreamReader.ReadToEnd()
    
        $StreamReader.Close()
        $Stream.Close()
    
        return $BinaryText
    }
    

    Using the above function, you should be able to get the binary part from the multipart file like this:

    $inputFile  = 'D:\blah.txt'
    $outputFile = 'D:\blah.pdf'
    
    # read the file as byte array
    $fileBytes = [System.IO.File]::ReadAllBytes($inputFile)
    # and again as string where every byte has a 1-to-1 mapping to the file's original bytes
    $binString = ConvertTo-BinaryString -Path $inputFile
    
    # create your regex, all as ASCII byte characters: '%PDF.*%%EOF[\r?\n]{0,2}'
    $regex = [Regex]'(?s)(\x25\x50\x44\x46[\x00-\xFF]*\x25\x25\x45\x4F\x46[\x0D\x0A]{0,2})'
    $match = $regex.Match($binString)
    
    # use a MemoryStream object to store the result
    $stream = New-Object System.IO.MemoryStream
    $stream.Write($fileBytes, $match.Index, $match.Length)
    
    # save the binary data of the match as a series of bytes
    [System.IO.File]::WriteAllBytes($outputFile, $stream.ToArray())
    
    # clean up
    $stream.Dispose()
    

    Regex details:

    (                 Match the regular expression below and capture its match into backreference number 1
       \x25           Match the ASCII or ANSI character with position 0x25 (37 decimal => %) in the character set
       \x50           Match the ASCII or ANSI character with position 0x50 (80 decimal => P) in the character set
       \x44           Match the ASCII or ANSI character with position 0x44 (68 decimal => D) in the character set
       \x46           Match the ASCII or ANSI character with position 0x46 (70 decimal => F) in the character set
       [\x00-\xFF]    Match a single character in the range between ASCII character 0x00 (0 decimal) and ASCII character 0xFF (255 decimal)
          *           Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
       \x25           Match the ASCII or ANSI character with position 0x25 (37 decimal => %) in the character set
       \x25           Match the ASCII or ANSI character with position 0x25 (37 decimal => %) in the character set
       \x45           Match the ASCII or ANSI character with position 0x45 (69 decimal => E) in the character set
       \x4F           Match the ASCII or ANSI character with position 0x4F (79 decimal => O) in the character set
       \x46           Match the ASCII or ANSI character with position 0x46 (70 decimal => F) in the character set
       [\x0D\x0A]     Match a single character present in the list below
                          ASCII character 0x0D (13 decimal)
                          ASCII character 0x0A (10 decimal)
          {0,2}       Between zero and 2 times, as many times as possible, giving back as needed (greedy)
    )