Search code examples
pythondocxembeddingbinmsg

How to convert an embedded .bin file from a Word document to its original .msg format?


I'm currently putting together some code to extract a variety of files that are embedded in a Word document using Python, but I'm having particular trouble figuring out how to restore an embedded Outlook .msg file back to its original (usable) .msg form after extracting it as an oleObject.bin file. Does anyone have an idea how to do this?

It's pretty straight forward to restore PDF files and the zipfile library has built in tools to deal with zip files in .bin form, but I'm really scratching my head on these .msg files. I can't find a way to carve out the original file from all the added binary data. Any help or thoughts on this would be appreciated!

I essentially want to do the same thing as this question but for .msg files instead of PDFs: How can I decode a .bin into a .pdf

Edit: This is the error I get when I try to just rename the file extension of the .bin to .msg


Solution

  • OLE Objects, If correctly embedded (not linked) are simply all the same as their source. So you can run them in their application and save them from that application. Thus the text will save in Notepad. The Zip will not need save as its a folder thus simply needs MOVE from its temporary location. And for a MSG it will be saveable from Outlook if you trust it to open.

    enter image description here

    If you don't have Outlook it can open in NotePad too (but will only be salvageable as plain text AND RTF if included). Here we see the Fax Sample entry from Me to You with complimentary message Hello World!

    enter image description here
    If we save the RTF we can see the RTF body content in WordPad (and thus auto-print to PDF using Write /PT ....)
    enter image description here

    If you want to pull all the bins use TAR -xf to unpack the .docX

    hello - docx.zip\word\embeddings enter image description here

    These will include (as you observed) from another question, headings and trailers. Of course you will not know which is which, without look inside and remove the header/trailer but a Zip will start with PK
    enter image description here

    A .MSG will start with the DOC signature
    enter image description here

    The start of a MSG file will be marked with ÐÏ à
    which in hex should be something like D0 cF 11 e0 i.e its a "DocFile"

    the end of a msg has 16 bit FEFF FFFF ... padding so ends say
    þÿÿÿýÿÿÿÿÿÿÿÿ ...lots more ÿÿ ... ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
    The bin has more data so the end of that block is dirty with 16bit filename and path
    ÿÿÿÿÿÿÿÿT C : \ U s e r s \ n a m e \ A p p D a t a \ L o c a l \ T e m p \ { A 0 9 5 A 1 6 4 - 2 B 3 6 - 4 9 0 5 - A 2 9 4 - E 5 B C C B 9 5 B 9 B 5 } \ H e l l o ( 2 ) . m s g H e l l o . m s g C : \ U s e r s \ n a m e \ D o c u m e n t s \ H e l l o . m s g

    unsure if the T is significant in some cases or just buffer debris so you need to check.