I have a multipart file received from a server and I need to pick out the pdf part from it. I tried with removing the first x lines and the last 2 with
$content=Get-Content $originalfile
$content[0..($content.length-3)] |$outfile
but it corrupts the binary data, so what is the way to get the binary part from the file?
MIME-Version: 1.0
Content-Type: multipart/related; boundary=MIME_Boundary;
start="<6624867311297537120--4d6a31bb.16a77205e4d.3282>";
type="text/xml"
--MIME_Boundary
Content-ID: <6624867311297537120--4d6a31bb.16a77205e4d.3282>
Content-Type: text/xml; charset=utf-8
Content-Transfer-Encoding: 8bit
<?xml version="1.0" encoding="UTF-8"?>
<soapenv:Body xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"/>
--MIME_Boundary
Content-ID:
Content-Type: application/xml
Content-Disposition: form-data; name="metadata"
<?xml version="1.0" encoding="ISO-8859-1"?>
<metadata><contentLength>64288</contentLength><etag>7e3da21f7ed1b434def94f4b</etag><contentType>application/octet-stream</contentType><properties><property><key>Account</key><value>finance</value></property><property><key>Business Unit</key><value>EU DEBMfg</value></property><property><key>Document Type</key><value>PAYABLES</value></property><property><key>Filename</key><value>test-pdf.pdf</value></property></properties></metadata>
--MIME_Boundary
Content-ID:
Content-Type: application/octet-stream
Content-Disposition: form-data; name="content"
%PDF-1.6
%âãÏÓ
37 0 obj <</Linearized 1/L 20597/O 40/E 14115/N 1/T 19795/H [ 1005 215]>>
endobj
xref
37 34
0000000016 00000 n
0000001386 00000 n
0000001522 00000 n
0000001787 00000 n
0000002250 00000 n
.
.
.
0000062787 00000 n
0000063242 00000 n
trailer
<<
/Size 76
/Prev 116
/Root 74 0 R
/Encrypt 38 0 R
/Info 75 0 R
/ID [ <C21F21EA44C1E2ED2581435FA5A2DCCE> <3B7296EB948466CB53FB76CC134E3E76> ]
>>
startxref
63926
%%EOF
--MIME_Boundary-
You need to read the file as a series of bytes and treat it as a binary file. Next, to parse out the PDF part of the file, you need to read it again as String, so you can perform Regular Expression on it.
The String should be in an encoding that does not alter the bytes in any way, and for that, there is the special encoding Codepage 28591 (ISO 8859-1)
with which the bytes in the original file are used as-is.
To do this, I've written the following helper function:
function ConvertTo-BinaryString {
# converts the bytes of a file to a string that has a
# 1-to-1 mapping back to the file's original bytes.
# Useful for performing binary regular expressions.
Param (
[Parameter(Mandatory = $True, ValueFromPipeline = $True, Position = 0)]
[ValidateScript( { Test-Path $_ -PathType Leaf } )]
[String]$Path
)
$Stream = New-Object System.IO.FileStream -ArgumentList $Path, 'Open', 'Read'
# Note: Codepage 28591 (ISO 8859-1) returns a 1-to-1 char to byte mapping
$Encoding = [Text.Encoding]::GetEncoding(28591)
$StreamReader = New-Object System.IO.StreamReader -ArgumentList $Stream, $Encoding
$BinaryText = $StreamReader.ReadToEnd()
$StreamReader.Close()
$Stream.Close()
return $BinaryText
}
Using the above function, you should be able to get the binary part from the multipart file like this:
$inputFile = 'D:\blah.txt'
$outputFile = 'D:\blah.pdf'
# read the file as byte array
$fileBytes = [System.IO.File]::ReadAllBytes($inputFile)
# and again as string where every byte has a 1-to-1 mapping to the file's original bytes
$binString = ConvertTo-BinaryString -Path $inputFile
# create your regex, all as ASCII byte characters: '%PDF.*%%EOF[\r?\n]{0,2}'
$regex = [Regex]'(?s)(\x25\x50\x44\x46[\x00-\xFF]*\x25\x25\x45\x4F\x46[\x0D\x0A]{0,2})'
$match = $regex.Match($binString)
# use a MemoryStream object to store the result
$stream = New-Object System.IO.MemoryStream
$stream.Write($fileBytes, $match.Index, $match.Length)
# save the binary data of the match as a series of bytes
[System.IO.File]::WriteAllBytes($outputFile, $stream.ToArray())
# clean up
$stream.Dispose()
Regex details:
( Match the regular expression below and capture its match into backreference number 1
\x25 Match the ASCII or ANSI character with position 0x25 (37 decimal => %) in the character set
\x50 Match the ASCII or ANSI character with position 0x50 (80 decimal => P) in the character set
\x44 Match the ASCII or ANSI character with position 0x44 (68 decimal => D) in the character set
\x46 Match the ASCII or ANSI character with position 0x46 (70 decimal => F) in the character set
[\x00-\xFF] Match a single character in the range between ASCII character 0x00 (0 decimal) and ASCII character 0xFF (255 decimal)
* Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\x25 Match the ASCII or ANSI character with position 0x25 (37 decimal => %) in the character set
\x25 Match the ASCII or ANSI character with position 0x25 (37 decimal => %) in the character set
\x45 Match the ASCII or ANSI character with position 0x45 (69 decimal => E) in the character set
\x4F Match the ASCII or ANSI character with position 0x4F (79 decimal => O) in the character set
\x46 Match the ASCII or ANSI character with position 0x46 (70 decimal => F) in the character set
[\x0D\x0A] Match a single character present in the list below
ASCII character 0x0D (13 decimal)
ASCII character 0x0A (10 decimal)
{0,2} Between zero and 2 times, as many times as possible, giving back as needed (greedy)
)